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CHAPTER  0 


NOTATION 

The  following  notations  and  conventions  are  used. 

Lower  case  latin  letters  denote  vectors,  functions,  and  integers. 
Upper  case  latin  letters  denote  matrices  and  index  sets. 

Greek  letters  denote'  real  numbers. 

Script  letters  denote  sets  and  classes. 

The  transpose  of  A is  A' . 

The  ith  component  of  x is  (x)^. 

No  notational  distinction  will  be  made  between  row  vectors  and 
column  vectors. 

The  inner  product  of  the  row  vector  w and  the  column  vector  x 
is  denoted  by  wx. 

The  special  vector  e of  dimension  n,  called  the  unitary  vector, 
is  defined  by 

(e)^=l,  1 = l,...,n  . 

The  following  special  functions  are  defined: 

a)  Positive  part  function: 

a if  a > 0 

0 if  a < 0 

b)  Negative  part  function: 

( -a  if  a < 0 

a"  = 

( o if  a > o 


l 


c)  Sign  function: 


1 _1 

if 

a < o 

sgn(a)  = < o 

if 

o 

ii 

a 

( +1 

if 

a > o 

If  any  of  these  functions  appears  with  a vector  argument,  the  function  applies 
to  each  component,  i.e. 


(f(x))i  = f^x)^ 

Similarly,  the  inequality  x > 0 requires  all  components  of  x to  be 
positive. 


CHAPTER  I 


INTRODUCTION 

1.1.  Pattern  Recognition  and  Classification 

Pattern  recognition  Is  concerned  with  the  universal  problem  of  identify- 
ing the  "class"  of  an  object  from  examination  of  its  attributes.  A major 
objective  of  pattern  recognition  is  the  development  of  machine  implementable 
methods  of  classification.  For  simple  applications  such  as  optical  reading 

1 

of  characters  with  a fixed  type  font,  such  methods  offer  great  increases  in 
speed  an-1  accuracy  relative  to  human  processing.  For  more  difficult  problems 

i 

such  as  medical  diagnosis  or  weather  prediction,  complex  relationships  in 
large  quantities  of  multi-dimensional  data  may  not  immediately  be  apparent 
to  casual  observation.  In  such  cases,  algorithmic  procedures  implemented  on 
a computer  can  often  complement  and  extend  human  recognition  capabilities. 

The  recognition  process  can  be  divided  into  two  phases,  feature  extrac- 
tion and  classification.  Feature  extraction  involves  isolating  the  most 
relevant  portions  of  the  available  data  and  representing  them  in  a compact, 
useful  form.  A pattern  is  defined  to  be  a finite  dimensional  vector  x € 3Rn. 

Each  component  of  the  pattern  is  called  a feature.  Features  are  functions 
of  observable  data  concerning  the  object  to  be  classified.  The  feature 
extraction  process  consists  of  reducing  points  in  a general  measurement 
space  to  points  in  a finite  dimensional  pattern  space. 

For  example,  let  the  measurement  space  consist  of  continuous  functions 
ft®1-*  1R1  on  the  finite  interval  [a^a^].  This  case  occurs  in  the 
analysis  of  electrocardiograms,  electroencephalograms,  various  kinds  of 
spectra,  and  generally  in  problems  where  the  physical  data  consists  of 
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continuous  waveforms.  A simple 
the  function  on  a uniform  grid; 

(x)i  = f(aL  + 

where 

6 

Another  alternative  is  to  find  an  approximating  function  such  as  a polynomial 
that  is  defined  by  a finite  set  of  parameters  or  coefficients  which  can  then 
be  used  as  the  features.  Clearly  some  feature  sets  will  be  better  than  others, 
but  there  are  few  if  any  general  purpose  feature  extraction  methods  that 
yield  good  results  for  a wide  variety  of  applications.  Guess  work,  intuition, 
and  experience  with  the  specific  problem  are  usually  necessary  to  develop 
a good  feature  set. 

Classification  is  concerned  with  determining  decision  procedures  for 
assigning  one  of  a finite  number  of  class  labels  to  a given  pattern.  The 
distinction  between  classification  and  feature  extraction  is  not  sharp, 
since  classification  itself  may  be  a multi-stage  process  involving  several 
transformations  of  the  original  pattern  space. 

Here  we  will  be  concerned  with  pattern  classification  procedures  based 
on  mathematical  programming  methods.  Thus  it  is  assumed  that  an  initial 
set  of  features  is  given. 

1.2.  Discriminants  and  the  Two-Class  Problem 

Let  x £ IRn  be  a pattern  that  belongs  to  one  of  two  possible 
classes,  or  C^.  One  common  form  of  classification  rule  decides 


set  of  features  can  be  generated  by  sampling 


(i-l)6)  , i = 1, . . . , n 

(«2  - C^) 

= (n  - 1)  * 


1* 


(1.2.1) 


f (x)  > 0 


x e c if 

x € C0  if  f (x)  < 0 


where  f:IRn  ->  ]R^  is  called  a discriminant  function.  The  case  f(x)  - 0 
is  considered  indeterminant  and  an  arbitrary  decision  may  be  made  or  the 
choice  may  be  randomized  with  specified  probabilities.  Geometrically,  the 
function  f divides  the  pattern  space  into  the  disjoint  regions 


9^  = (x:f(x)  > 0] 

= {x; f (x)  < 0) 

For  many  classes  of  functions,  the  equation  f(x)  = 0 defines  a surface 
that  bounds  these  regions.  In  this  case,  f(x)  is  said  to  separate 
9L  and  9t  . 

There  are  several  types  of  classification  problems,  each  with  its 

own  solutioxi  philosophy.  Two  of  these  problems  form  the  basis  for  much 

of  the  discussion  here.  The  first,  or  template-matching  problem,  is 

characterized  by  two  given  finite  sets  of  prototype  pattern  vectors  x., 

3 

one  set  for  each  class.  Let 

= ...  , X^j 

4?  = ^Xk+1»  ' xm^ 

be  the  prototype  patterns  for  classes  C and  C^,  respectively.  Each 
observed  pattern  from  a given  class  can  be  identified  with  one  of  the 
prototype  patterns  from  that  class,  differing  from  the  prototype  by  a 
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relatively  small  displacement  vector  d.  The  displacement  vector  can  be 
thought  of  as  a random  error  associated  with  the  physical  measurement  process 
or  as  a statistical  variation  in  the  pattern  population  itself.  Thus 


(1.2.2) 


?i  = {x  + d;x  € d e d)  , i = 1,2 


where  D is  the  set  of  possible  displacement  vectors. 

The  prototype  sets  J ^ are  often  called  design  or  trainin 

sets.  One  general  solution  procedure  for  this  problem  involves  assuming 
a parametric  functional  form  f(x;p)  for  the  discriminant,  where  p is 
the  parameter  vector.  The  vector  p is  chosen  so  that  f(x^;p)  > 0, 
x^^  £ ^ and  fCx^p)  <0,  xi  £ d ?>  if  possible,  i.e.  by  solving  t*e 

inequality  system 


(1.2.3) 


f(xi;p)  > 0 , 

f(xi;p)  < 0 , 


i — 1, . . . , k 
i = k+1,  . . . , m 


A feasible  solution  to  (1.2.3)  defines  a discriminant  that  correctly 


classifies  the  design  sets 


and  If  the  system  is  feasible, 


the  sets  are  said  to  be  separable  over  the  assumed  parametric 

functional  form.  If  the  functional  form  f (x;p)  is  continuous  and  the 
set  of  displacement  vectors  D is  bounded  by  sufficiently  small  bounds, 
then  the  discriminant  defined  by  this  procedure  will  also  separate  the 
complete  pattern  classes  C ^ and  C?. 

Desirable  properties  of  a discriminant  function  for  the  template- 


matching problem  are  errorless  performance  on  the  design  sets  ^ 2 
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and  separation  of  C ^ and  for  the  largest  possible  set  of  displacement 

vectors.  Let  o = (d:||d||  < a)  for  some  vector  norm  ||*||.  Then  the 

temp late -matching  problem  for  this  vector  norm  is  defined  as 

(1.2.4)  max  a 

s.t.  f (x;p)  > 0 , V x € 

f(x;p)  < 0 , V x c C2 

where 

C.  = (x  + d-.x  E iv  d € D)  , i = 1,2. 

(See  1.2.1.) 

A common  choice  of  functional  form  is  the  linear  discriminant 

f(x)  = w-x  - 9 

This  case  is  quite  general  since  any  discriminant  of  the  form 

s 

f(x)  = Z a f (x)  - 6 
i=l 

is  linear  with  respect  to  the  transformed  pattern  y € 1RS  defined  by 
(y)i  = ^(x)  , i = 1, . . . , s . 

Thus  techniques  developed  for  generation  of  linear  discriminants  are  also 
applicable  to  all  functions  f(x;p)  that  are  linear  in  the  parameter 
vector  p,  e.g.  polynomials  of  all  degrees  in  the  components  of  x. 

The  template-matching  problem  (1.2.4)  for  linear  discriminants  is  dis- 
cussed in  the  next  chapter. 
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Template  matching  problems  arise  in  relatively  simple,  well-defined 
contexts  such  as  optical  recognition  of  characters  printed  in  a fixed  type 
font,  where  pattern  variation  is  very  limited.  More  complex  problems, 
such  as  medical  diagnosis,  often  involve  patterns  that  do  not  always  fall 
into  clOoe  groupings  around  prototypes.  In  this  second  kind  of  problem, 
observed  patterns  are  considered  as  random  samples  from  classes  having 
different  probability  distributions.  If  the  class  distributions  overlap, 
then  an  errorless  classification  scheme  for  the  complete  classes  C ^ and 
Cg  is,  of  course,  impossible.  A discriminant  is  sought  that  minimizes  some 
loss  criterion  such  as  the  probability  of  misclassification. 

This  problem  also  involves  two  training  sets  1$^,  £ ^ consisting 

of  examples  of  patterns  from  the  respective  classes  C^,  C0.  A discriminant 
f(xjp)  is  sought  that  performs  well,  although  not  necessarily  perfectly, 
on  the  training  sets.  If  these  sets  are  large  and  well  representative  of 
their  respective  source  distributions,  then  such  a discriminant  should 
perform  well  on  these  distributions.  Some  specific  models  and  results  for 
this  type  of  problem  are  discussed  in  Chapter  4. 

1.3.  Outline  of  Presentation 

Chapter  2 deals  with  classification  problems  for  which  linear 
discriminants  can  be  found  that  separate  the  two  design  sets.  Mathematical 
programming  methods  for  determining  these  discriminants  are  discussed  and 
reliability  interpretations  are  made  for  a class  of  template-matching  problems. 
An  application  to  a set  of  adaptive  pattern  classification  machines  is 
given. 


t . - 


8 


In  Chapter  3 the  least  positive  deviations  solution  concept  for  a 
possibly  infeasible  system  of  linear  inequalities  is  defined.  Connections 
with  linear  programming  are  established  and  a very  efficient  algorithm 
based  on  an  unusual  pivoting  rule  is  developed  for  determining  this 
solution.  Application  of  the  algorithm  is  extended  to  a sequence  of  problems 
of  which  the  most  general  is  the  general  linear  programming  problem. 

In  Chapter  h this  solution  concept  is  applied  to  linearly  inseparable 
classification  problems.  Large  sample  solution  characterizations  are 
obtained  for  design  sets  consisting  of  random  samples  from  overlapping 
source  distributions.  Several  alternative  approaches  to  this  problem  are 
discussed  and  some  numerical  results  utilizing  the  algorithm  of  Chapter  3 
are  presented. 

Chapter  5 extends  these  methods  to  piecewise  linear  discriminants. 

A transformation  of  the  pattern  space  is  defined  that  renders  any  pair 
of  finite,  disjoint  design  sets  separable  by  a convex  piecewise  linear 
function.  An  algorithm  is  presented  that  constructs  such  a function  by 
solving  a sequence  of  linear  programs  of  a type  directly  suitable  for 
application  of  the  least  positive  deviations  algorithm.  Results  for  a 
sample  problem  are  reported. 
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CHAPTER  ? 


LINEAR  SEPARABLE  CLASSIFICATION  PROBLEMS 


2.1.  Linear  Separability 

Let  ^ = {x1,  ...  , xk),  /$ 2 = {xk+1,  . ..  , x } be  finite, 
disjoint,  nonempty  sets  of  n-dimensional  patterns  from  classes  and 

<?2,  respectively.  These  sets  are  defined  to  be  linearly  separable  if 
there  exists  a linear  discriminant  f(x)  = wx  - 0 such  that 


f(x)  >0  V x € J 

f(x)  <0  Vx£  J 2. 


The  vector  w is  called  the  weight  vector  and  the  real  number  0 is 
called  the  threshold  for  reasons  to  be  described  in  the  next  section. 
Geometrically,  and  are  linearly  separable  if,  as  illustrated  in 

Figure  (2.1.1),  there  exists  a separating  hyperplane  wx  = 0 such  that 
all  patterns  in  lie  in  one  half-space  and  all  patterns  in  gj 

lie  in  the  other. 

For  each  pattern  x £ ]Rn  , a corresponding  signed  augmented  pattern 
n+1 

a ]R  is  defined  by 


(2.1.2) 


a = 


(x,-l)  , 
(-x,+l)  , 


if 

x e 

if 

x £ 

J, 
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<s> 

(s> 

<Q 

(9 


* - class  Cx  pattern 
<s>  = class  Cz  pattern 


Figure  (2.1.1).  Linear  Separable  Pattern  Sets. 


The  signed  augmented  pattern  matrix  A £ -|p®x(n+l)  j_s  defined  by 


(2.1.3) 


A = 


m j 


-X, 


where  X^  and  are  matrices  whose  rows  are  the  patterns  (row  vectors) 

in  tj  1 and  respectively,  and  e^  and  e^  are  unitary  column 

vectors . 
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PROPOSITION  (2.I.M. 


1* 


2 


are  linear  separable  iff  the  inequality  system 


(2.1.5) 


Au  > 0 


u e 3R 


n+1 


is  feasible. 

Proof.  The  proof  follows  immediately  from  the  identification  u = (w, 0).  □ 

Clearly  the  system  (2.1.5)  is  feasible  iff  the  system 

(2.1.6)  Au  > e 

u = (w,e)  e ]Rn+1 

is  feasible.  System  (2.1.6)  will  serve  as  the  constraint  set  in  several 
of  the  mathematical  programming  models  discussed  below.  Application  of 
the  following  version  of  the  Farkas  lemma  provides  a geometric  criterion 
for  linear  separability. 

LEMMA  (Farkas).  The  inequality  system 

Au  > b 
u£  1° 

is  feasible  iff  the  dual  system 

A'y  = 0 
b*y  > 0 

y > 0,  y£  l“ 


is  infeasible. 


For 


A = 

and  b = e,  the  dual  system  is 

(2.1.7)  Xlyl  ' X2y2  =0 

-el'yl  + e2*y2  = 0 

el-yl  + e2*y2  > 0 

(yvV2)  > 0 > yL  £ Kk,  y2  € IRm"k 

Since  the  system  is  homogeneous  and  •y1 = e2 ,y2  i 0 any  feasible  solution 
(y-j^yg)  can  be  scaled  so  that  e1-y1  = = ■L*  Then  and 

X2y2  are  P°irrts  in  the  convex  hulls  of  ^ and  ^ , respectively. 

Thus  the  Farkas  lemma  stated  geometrically  says: 

PROPOSITION  (2, 1.  B) . J ^ are  linearly  separable  iff  their  respective 
convex  hulls  do  not  intersect. 

2.2.  Threshold  Logic  Units  and  Adaptive  Machines 

A device  designed  to  implement  a linear  discriminant  function  is 
shown  schematically  in  Figure  (2.2.1).  The  device  is  called  a threshold 
logic  unit  (TLU)  and  has  aroused  considerable  interest  as  a simple  mathe- 
matical model  of  a neuron  (e.g.  [1],  [2],  [3]).  A TLU  has  n input 
terminals,  one  for  each  pattern  component.  Each  pattern  component  (x^ 


. JbeLl 
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Figure  (2.2.1).  Threshold  Logic  Unit  for  Implementing  the  Discriminant 
f(x)  = vx  - 9. 

is  multiplied  by  an  adjustable  internal  weight  (w)^  The  results  are 
summed  and  compared  to  an  adjustable  threshold  0.  An  output  of  +1  is 
made  if  the  sum  equals  or  exceeds  6,  otherwise  the  output  is  -1. 

(In  the  neuron  model,  the  + 1 output  corresponds  to  the  "firing"  of  a 
neuron  in  the  presence  of  certain  stimuli.  The  -1  output  represents 
the  normal,  inactive  state. ) 

The  choice  of  the  weight  vector  w and  the  threshold  6 determine 
the  patterns  or  stimuli  that  activate  the  TLU.  Because  the  weights  and 
threshold  are  adjustable,  the  TLU  can  be  regarded  as  "trainable"  and 
various  adaptive  algorithms  have  been  devised  for  training.  In  particular, 

14 


error  correction  procedures  have  been  investigated  as  training  methods 
based  on  the  following  general  scheme.  Patterns  are  selected  from 
and  in  some  prescribed  manner  and  presented  to  the  TLU  for  classifi- 

cation. If  a pattern  is  correctly  classified,  no  corrective  action  is 
taken.  If,  however,  the  classification  is  incorrect,  the  weights  and 
threshold  are  adjusted  in  a manner  tending  to  correct  the  error. 

A classic  example  is  the  Perceptron  error  correction  procedure 
due  to  Rosenblatt  f2]: 


(2.2.2)  Step  1.  Set  u^  = (w  , 0^)  to  an  arbitrary  vector. 

Set  k = 1.  Go  to  Step  2. 

Step  2.  Stop  if  u^  defines  a separating  hyperplane.  Other- 
wise select  any  pattern  x £ A 0 4 which  is 
incorrectly  classified  by  u^.  Let  a be  the 
corresponding  signed  augmented  pattern,  so  u^a  £ 0- 
Go  to  Step  3. 

Step  3.  Set  = u^  + a.  Increment  k by  1 and  go  to  Step  2. 


It  can  be  shown  (Novikoff  [U])  that  if  jJ  J!  ? are  linearly  separable, 
then  the  algorithm  converges  in  a finite  number  of  steps  to  a separating 
u*.  Numerous  variants  to  the  procedure  exist  and  a summary  of  error 
correction  procedures  for  solution  of  the  system  Au  > 0 is  presented  by 
Duda  and  Hart  ( 5 1 - One  major  drawback  to  this  class  of  methods  is  that 
they  are  generally  ineffective  in  the  linearly  inseparable  case  in  that  no 
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determination  of  linear  inseparability  is  made  in  a finite  number  of  steps. 
This  deficiency  is  corrected  in  the  mathematical  programming  methods  dis- 
cussed later  in  this  chapter. 

More  elaborate  learning  devices  called  Perceptrons  can  be  con- 
structed by  assembling  TLUs  into  layered  networks  such  as  that  shown  in 
Figure  (2.2.3).  Each  TLU  in  the  first,  outer  layer  computes  a binary  function 
of  the  pattern  vector.  Subsequent  inner  layers  perform  Boolean  operations 
on  these  binary  functions.  The  innermost  layer  is  a single  TLU  which  makes 
the  decision.  The  overall  discriminant  function  implemented  by  such  a 
network  is  piecewise  linear,  and  with  a sufficiently  large  number  of  TLUs, 
any  two  finite,  disjoint  pattern  sets  can  be  separated.  Unfortunately, 
there  is  no  known  error-correction  training  algorithm  analogous  to  (2.2.2) 
that  is  guaranteed  to  converge  to  a piecewise  linear  function  capable  of 
such  a separation.  Training  is  usually  confined  to  the  innermost  TLU 
with  the  remaining  weights  being  selected  by  heuristic  or  even  random 
procedures.  In  Section  (2.3)  a linear  programming  procedure  is  presented 
for  determining  the  weights  and  threshold  in  the  inner  layer  that  maximizes 
the  reliability  of  a two-layer  Perceptron  when  the  outer  layer  TLUs  are 
subject  to  failure.  Also,  in  Chapter  5 mathematical  programming  methods 
are  presented  that  determine  a separating  piecewise  linear  discriminant 
for  general  finite  disjoint  pattern  sets. 
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2.3. 


Maximum  Quality  Programs 


Let  u = (w, 0)  solve  Au  > 0.  Since  the  system  is  homogeneous, 

A 

Au  is  also  a solution  for  any  A > 0;  i.e.  the  underlying  separating 

A A 

hyperplane  Aw-x  = A 0 is  invariant  to  the  choice  of  the  scale  factor  A. 
It  is  convenient  to  scale  u so  that  ||w||  = 1,  where  ||*||  is  a vector 
norm. 

Grinold  [6]  defines  the  quality  of  the  separating  hyperplane 

A 

corresponding  to  u as 

(2.3.1)  Q(u)  = min  {(Au)^ 

i=l, . . , , m 

He  observes  that  many  of  the  mathematical  programming  models  for  pattern 
classification  that  have  been  suggested  in  the  literature  are  of  the 
general  form 

max  Q(u) 
s.t.  u 6 U 

where  U is  some  subset  of  feasible  solutions  to  Au  > 0 on  which  Q(u) 
is  bounded.  The  set 

U = {u:u  = (w,  0),  Au  > 0,  || w||  = 1} 

is  a common  choice  that  results  in  the  mathematical  program 

(2.3.2)  max  A 

s.t.  Au  - Ae  > 0 
||w||  < 1 

u = (w,  0)  e ]Rn+1 
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Let  A*  be  the  optimal  objective  value  for  (2.3.2),  There  are 
three  possible  cases. 

Case  1:  A*  = 0.  This  corresponds  to  the  optimal  solution  u*  = 0.  It 
follows  from  Proposition  (2.1.4)  that  ^ are  linearly 

inseparable. 

Case  2;  0 < A*  < «.  An  optimal  solution  u*  = (w*,  0*)  defines  a separating 

hyperplane  w*-x*  = 9*.  The  constraint  ||w*||  < 1 must  be  tight; 
otherwise  u = u*/||w*||  is  a better  solution.  Also,  at  least  one 
of  the  constraints 

Au*  - A*e  > 0 

is  tight;  otherwise  A = A*  may  be  increased  while  maintaining 
u = u*.  Hence  A*  = Q(u*) . 

Case  3;  A*  = + «.  If  ^ is  empty,  then  A = [-Xp  ] e ] and  u = (w,  9) 
is  a feasible  solution  for  all  suff icientlji  large  values  of  9 
and  any  w satisfying  ||wj|  < 1.  Hence  A is  unbounded  in  this 
case  and  similarly  in  the  case  where  is  empty  and 

A = (X^  ! -e^] . If  neither  set  is  empty,  then  there  exist  at 
least  two  constraints  of  the  form 

w*x1  ~ 9 > A , xL  6 

-w.Xj  + 9 > A , xj  e 

which  imply 


i , - 
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X<  |w(Xi  - X ) 


Hence  X must  be  bounded  since  !| w||  < 1 and  the  sets 


d and  O , 

1 C 


are  finite.  Thus  this  case  is  eliminated  by  assuming  neither  sample  set 
is  empty. 

These  results  are  summarized  in  the  following  proposition; 

PROPOSITION  (2.3.3).  Let  J , J p be  finite,  non-empty  pattern  sets. 

Let  u*  = (w *,6*)  be  an  optimal  solution  to  (2.3.2)  with  objective  value 


A*.  Then 


are  linearly  separable  iff  X*  > 0,  and  in  the 


v2 

separable  case  ||w*||  = 1 and  Q(u*)  = X*. 

If  X*  > 0,  there  is  an  equivalent  form  of  (2.3.2); 


min  w 


(2.3.M 


s.t. 


Au  > e 


u = (w,  0)  £ 3R' 


n+1 


The  equivalence  of  (2.3.2)  and  (2.3.^)  in  the  linearly  separable  case  can 
be  demonstrated  by  rewriting  (2.3. M as 


max 


w 


= x 


s.t 


Ae>° 
u = (w,e)  € TRn+1 

Thus  if  u*  = (w*,  0*)  solves  (2.3.  M,  then  u*/llw*ll  solves  (2.3.2) 
with  max  X = l/||w*||.  Conversely,  if  u*  = (w*,  0*)  solves  (2.3.2) 


i , - 
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with  A*  > 0,  then  u*/A*  solves  (2.3. M with  min||w||  = l/A*. 

The  programs  (2.3.2)  and  (2.3-M  will  he  called  the  primary  and 
alternative  forms,  respectively,  of  the  maximum  quality  problem.  Several 
different  mathematical  programming  methods  become  applicable  with  the 
choice  of  the  specific  vector  norm.  Let 

W =(Z  |(w).|P)l/p 

P j=l  J 

denote  the  norm  for  1 < P < 00 , where 

llw|lw  = max  { | (w)  | } 
j=l, ...,n  J 

The  i^,  and  norms  are  of  particular  interest  since  the  maximum 

quality  problem  can  be  formulated  as  a linear  program  in  the  and 

cases  and  as  a quadratic  program  in  the  case. 

The  i norm  case  leads  to  the  following  linear  program  for  the 
00 

primary  form  of  the  maximum  quality  problem : 

max  A 

(2.3.5)  s.t. 

Au  - Ae  > 0 

-1  < (w) . < i j = 1, ...,n 

J 

u = (w,  y)  s iRn  1 

This  is  a variation  on  a model  originally  proposed  by  Mangasarian  [7]. 

System  (2.3.5)  has  the  following  reliability  interpretation  for 
the  two-layer  Perceptron  shown  in  Figure  (2.2.3).  In  this  TLU  network, 
the  first  layer  consists  of  k TLUs  whose  combined  output  forms  a 
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k n 

transformed  pattern  y £ 1R  for  each  input  pattern  x£  B , The  single 
second  layer  TLU  then  classifies  y.  The  network  is  defined  to  he  redundant 
if  no  final  classification  change  results  when  the  output  of  an  arbitrary 
single  TLU  in  the  first  layer  changes  from  4-1  to  -1  or  from  -1  to  + 1. 
Thus  a redundant  Perceptron  remains  reliable  with  respect  to  any  single 
TLU  failure  in  the  first  layer. 

Since  the  change  induced  by  a failed  TLU  in  the  corresponding 
component  of  the  transformed  pattern  y is  of  fixed  magnitude,  namely, 

2,  the  discriminant  function  f(y)  = vy  - 8 implemented  by  the  inner 
layer  TLU  will  not  change  sign  if  it  is  of  sufficiently  high  quality. 

This  is  made  explicit  by  the  following  proposition. 

PROPOSITION  (2.3.6).  Let  the  set  of  transformed  patterns  be  linearly 
separable  by  the  hyperplane  wy  - where  ||w||  = 1.  Then  the  Perceptron 

is  redundant  if  Q(u)  = Q(w, 0)  > 2. 

Proof.  Failure  of  the  jth  first  layer  TLU  changes  a transformed  pattern 
y to  y ' , where 

I (y' ),  - (y)j|  =2  . 

Let  f(y)  = wy  - 6.  Then 

|f(y')  - f(y)|  = 2|(w^)|  < 2,  since  ||w||w  = 1 

But  Q(w, 0)  > 2 implies  |f(y)|  > 2.  Hence  f(y)  and  f (y1 ) must  be 
both  either  strictly  positive  or  strictly  negative  and  therefore  no 
classification  change  occurs.  □ 


I . 
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If  Q(w , 0)  < 2,  the  network  may  not  be  redundant.  Since  |(w. )|  = 1 

*3 

for  at  least  one  component  j,  a failure  in  the  corresponding  TLU  implies 

|f(y’)  - f(y)|  = 2 

If,  for  example,  f(y)  = -Q(w,  0)  and  f (y' ) = 2 + f(y),  then 

( f(y)  < 0 

Q(w,  0)  < 2 =»  < 

( f(y’)  > o 

Hence  the  inner  layer  TLU  output  changes  from  -1  to  + 1. 

If  Q(w, 0)  > 2s,  where  s is  any  positive  integer,  then  by  the 
argument  used  for  Proposition  (2.3.6),  the  Perceptron  is  redundant  with 
respect  to  simultaneous  failure  of  any  s TLUs  in  the  outer  layer.  Thus 
Q(w,  0)  is  an  index  of  reliability  in  the  sense  described  above  and  the 
maximum  quality  program  (2.3.5)  is  a natural  choice  for  determining  the 
weights  and  threshold  of  the  inner  TLU  when  redundancy  is  a prime  concen- 
tration. 

A second  formulation  for  solution  of  the  linear  separability  problem 
is  suggested  by  Ibaraki  and  Maroga  [8]: 

min  Z | (w)  | 

3=1  J 

(2.3.7)  s.t.  Au  > e 

u = (w,  0)  c ]Rn+1 

This  is  the  alternative  form  of  the  maximum  quality  problem  with  the 
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norm.  Since  the  objective  function  is  convex,  piecewise  linear,  and 
separable,  (2.3.7)  has  the  linear  programming  equivalent 

+ — 
min  e«w  + e-w 

(2.3.8)  s.t.  Au+  - Au  > e 

u+  > 0,  u"  > 0,  u = (w,  9)  € ®n+1 

This  program  has  m constraints  in  2(n  + l)  non-negative  variables. 
Typically  m,  the  total  number  of  patterns,  is  much  larger  than  n,  the 
pattern  dimensionality.  Thus  it  may  be  computationally  advantageous  to 
solve  the  dual 


max  e«y 


(2.3.9) 


s.t.  ("e)  < A ' y < (*) 


y > 0,  y € ]R 


m 


which  has  (2n-l)  constraints  in  m non-negative  variables. 

Let  w*x  = (9  be  a separating  hyperplane  for  the  design  sets 

A. 

dp.  In  problems  such  as  the  template-matching  model  (1.2.4),  it  is 
desirable  for  the  discriminant  to  generalize  to  additional  patterns  that 
differ  from  those  in  the  design  sets  by  small  observation  errors  and  noise 
terms.  Ibaraki  and  Maroga  define  the  input  tolerance  6 associated 
with  w*x  =6  as  the  upper  bound  on  the  norm  of  displacement  vectors 

d such  that  x + d lies  on  the  same  side  of  the  hyperplane  as  x,  where 
x E /J ^ U Jp.  Thus  an  observed  pattern  x*  that  differs  from  a design 

pattern  x by  a magnitude  less  than  & in  each  component  will  be  classified 
into  the  same  class  as  x.  They  show  that  the  separating  hyperplane  defined 
by  an  optimal  solution  to  (2.3.7)  has  the  maximum  input  tolerance  of  all 
separating  hyperplanes. 
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The  alternative  form  of  the  maximum  quality  problem  with  the  fr 

C 

norm  was  first  investigated  by  Rosen  [ 9]  in  the  form  of  the  quadratic 
program 


(2.3.10) 


min  w-w 


s.t.  Au  > e 


u = (w,  0)  € ]R 


This  program  has  the  following  geometrical  interpretation.  Let 

A A A 

u = (w, 0)  be  any  feasible  solution  to  the  system  Au  > e.  The  hyperplanes 

A A A A A A 

w-x  = 6-1,  w.x  = 0 + 1 are  parallel  to  the  separating  hyperplane  w*x  = 0 

and  bound  a "dead  zone"  q = {x:  |w-x-0|  < 1}  of  width  2/||w||2  as  shown 
in  Figure  (2.3.11).  If  the  patterns  in  J1  and  J suffer  displacements, 


* * 
* * * 


/ <S>  , 

6 • t 

fc  <i> 


& 

<&  CD 


* - class  C , pattern 
6 “Class  C L pattern 


Figure  (2.3.11).  Pattern  Sets  Separated  by  an  Empty  "Dead  Zone." 
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the  hyperplane  wx  = 0 will  still  separate  the  displaced  sets  as  long  as 
all  displacements  are  of  Euclidean  distance  less  than  l/l|wj|^ , i.e.  half 
the  width  of  the  dead  zone.  Thus  the  optimal  solution  to  (2.3.9)  defines 
the  separating  hyperplane  with  dead  zone  of  greatest  width  and  hence  highest 
tolerance  to  pattern  displacements  as  measured  by  Euclidean  distance. 

The  reliability  results  of  this  section  are  all  examples  of  the 
following  general  principle.  Let  u = (w, 0)  define  a separating  hyperplane 
w*x  = 0 of  quality  Q(u)  for  the  pattern  sets  Let  the  pattern 

classes  C C ^ be  defined  for  a given  scalar  value  ct  > 0 by 

Ci  = {x  + d:x  € </.,  ||d||q  < a) 

Let  p,  q be  real  numbers  such  that  1 < p < «,  l<q<°°,  and 
l/p  + l/q  = 1. 

PROPOSITION  ( 2.2.12) . Let  w-x  = 0 separate  J 

If  a < Q(w,  0)/||w|jp,  then  the  hyperplane  wx  = 0 separates 
C1  and  C?. 

Proof.  Let  f(x)  = wx-0.  For  a given  displacement  d and  pattern  x 
x € U x and  x + d will  have  the  same  classification  if 

|f(x  + d)  - f (x)  | < | f (x)  | . But  Q(u)  = min^  j (J  p |f(x)|,  so  it  is 
sufficient  to  show  |f(x  + d)  - f (x) | < Q(u)  for  all  displacements  d 
such  that  ll<i||q  < O'.  But 

|f(x  + d)  - f (x)|  = |wd j 

and  the  result  follows  from  the  hypothesized  upper  bound  on  a and  the 


Holder  inequality 


| w-  d | < ||w||p  * ||d||q  . □ 

For  any  w there  exists  a displacement  d such  that  |wd| 

= ||w||  •IldL  (Luenberger  [10],  p.  30).  Thus  the  bound  Q(u)/||w||  on  the 
P 4 P 

& norm  of  allowable  displacements  is  sharp,  and  the  solution  of  the  maximum 
Q. 

quality  problem  (2.3.2)  or  (2.3.4)  defines  a separating  hyperplane  that 
maximizes  the  i norm  of  allowable  displacements.  The  Perceptron  results 
(p  = <*>,  q = 1),  the  input  tolerance  in  the  Ibaraki  and  Maroga  model 
(p  = 1,  q = ») , and  the  dead  zone  width  in  the  Rosen  model  (p  = 2,  q = 2) 
are  specializations  of  the  following  corollary  to  Proposition  (2.3.12). 

COROLLARY  (2.3.13).  Let  w*x  = Q be  a separating  hyperplane  defined  by 

~ .A  A 

an  optimal  solution  u = (w,  9)  to  the  maximum  quality  problem 

max  A 

s.t.  Au  - Ae  > 0 

llwllp  5 i 

/ . _ _n+l 

u = (w ,0)  E 1R 

/V  A 

Then  f(x)  = w.x-0  solves  the  template  matching  problem  (1.2.4)  for  the 
t,  norm,  i.e. 

q 

max  a 

s.t.  f (x;p)  >0  Vx  G (x  + d:x  G d G d) 

f (x;p)  <0  Vx  G <?2 3 [x  + d:x  G 4 2,  d € d) 

where  f(x;p)  = f(x;w,  0)  = w-x-ff  and  p = {d:||d||^  < a)  . 
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2.4.  Extensions  to  the  Inseparable  Case 

If  the  definition  (2.3.1)  of  the  quality  Q(u)  of  the  hyperplane 
defined  by  u = (w, 9)  is  extended  to  non-separating  hyperplanes,  the 
quality  of  such  hyperplanes  is  non-positive.  In  a linearly  inseparable 
problem,  a maximum  quality  hyperplane  may  provide  a useful  discriminant 
if  theregion  of  overlap  between  the  convex  hulls  of  ji ^ and  n is 
relatively  small.  However,  the  maximum  quality  program  (2.3.2)  is  no 
longer  applicable  since  it  produces  the  useless  optimal  solution  u = 0 
in  the  inseparable  case.  This  solution  can  be  eliminated  by  bounding 
1 1 w 1 1 away  from  zero  in  the  program 

max  A 

(2.4.  l)  s.t.  Au  - Ae  > 0 

||w||  > 1 

u = (w,  9 ) 2 lRn+^ 

which  is  obtained  from  (2.3.2)  by  reversing  the  inequality  defining  the 
bound  on  ||w|| . 

Let  u*  = (w*,  &*)  be  an  optimal  solution  to  (2.4.1)  with  optimal 
objective  value  A*.  There  are  three  cases. 

Case  1.  A*  = + °°.  This  occurs  when  ^ and  J ^ are  linearly  separable. 

In  this  case  there  exists  a solution  u = (w, 9)  with  quality  Q(u)  >0 
to  the  system 

Au  > 0 
||w||  = 1 

u = (w,e)  e En+1 

Then  for  all  a > 1,  u = ofi  is  feasible  for  (2.4.1)  with  corresponding 
objective  value  A = qQ(G). 
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Case  2.  A*  = 0.  This  is  a special  case  in  which  strict  linear  separability 
is  impossible  but  there  exists  a u*  = (w*,9*)  such  that 

w*.x  >9  Vx£  J, 

w*-x  <0  Vx£  J2 


with  at  least  one  inequality  being  tight  for  a pattern  from  each  ‘ 

sample  set.  Thus  the  convex  hulls  of  Jt  ^ and  ^ intersect 

only  in  a subset  of  the  hyperplane  w**x  =0*.  j 

1 

1 

Case  3.  A*  < 0.  This  is  the  linearly  inseparable  case  of  interest.  ] 

The  constraint  ||w||  >1  is  tight}  otherwise  u = u*/||w*||  is 
a better  solution.  Similarly,  at  least  one  of  the  inequalities  j 

Au*  - A*e  >0  is  tight;  otherwise  A = A*  may  be  increased 
while  maintaining  u = u*.  Hence  A*  = Q(u*)  and  (2.U. 1) 
defines  a maximum  quality  hyperplane. 


These  cases  are  summarized  in  the  following  analog  to  Proposition  (2.3.3). 


PROPOSITION  (2.4.2).  Let  ^ be  finite  pattern  sets.  Then  £ , 

are  linearly  inseparable  iff  (2.h.l)  has  a bounded  optimal  objective 
value  A*  < 0 corresponding  to  an  optimal  solution  u*  = (w*,0*).  If 
A*  < 0,  then  ||w*||  = 1 and  A*  = Q(u*). 


If  A*  < 0,  (2.U.1)  has  the  alternative  form 


(2.U.3) 


max  ||w|| 
s.t.  Au  > -e 

u = (w,0)  € 3R 
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If  u*  = (w*,0*)  solves  (2.4.3),  then  u*/llw*||  solves  (2.4.1)  with 
max  A = -l/||w*||.  Conversely,  if  u*  = (w*,0*)  solves  (2.4.1)  with 
X*  < 0,  then  u*/-A*  solves  (2.4.3)  with  max||w||  = -l/A*. 

A geometric  interpretation  of  the  maximum  quality  hyperplane 
produced  by  (2.4.3)  in  the  linearly  inseparable  case  with  the  norm 

is  shown  in  Figure  (2.4,4).  The  inequality  system  Au  > -e  is  equivalent  to 


w.x  - 9 > -1  V 


Figure  (2.4.4).  Linearly  Inseparable  Pattern  Sets. 
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Thus  all  patterns  in  j&  , lie  in  the  non-negative  half-space  of  the 
hyperplane  wx  = 0-1  and  all  patterns  in  lie  in  the  non-positive 

half-space  of  the  parallel  hyperplane  wx  = 8 + 1.  Thus  the  pattern 
sets  overlap  in  the  zone 

D = (x: |w-x  - e|  < 1}  , 

while  the  patterns  outside  this  zone  all  are  classified  correctly.  The 
zone  has  width  2/||w||0,  so  the  optimal  solution  to  (2.4.3)  is  defined  to 
be  the  one  whose  overlap  zone  is  of  minimum  width. 

The  maximum  quality  hyperplane  in  the  linearly  inseparable  case 
has  several  drawbacks.  First,  it  is  quite  difficult  to  solve  the 
programs  (2.4.1)  and  (2.4.5)  in  general.  In  (2.4.1)  the  constant  set  is 
non-convex,  while  (2.4.5)  requires  the  maximization  of  a convex  function, 
so  a Kuhn-Tucker  point  is  not  necessarily  a global  optimum.  Second,  the 
maximum  quality  hyperplane  may  be  a very  poor  choice  if  there  is  signifi- 
cant overlap  between  the  convex  hulls  of  and  £ . The  problem  : 

illustrated  by  the  following  example,  which  shows  that  the  maximum  qua! ity 
hyperplane  places  too  much  emphasis  on  the  outlying  or  "maverick"  patterns 
which  are  least  representative  of  their  own  classes. 


EXAMPLE  (2.4.6). 

Let  = (1,2,...,  k,  -(k+  1)),  J ={-1,-2 -k,  ( k + 1 

be  sets  of  one-dimensional  patterns.  For  k > 2,  the  linear  discriminan* 
with  the  lowest  error  rate  is  given  by  f(x)  - x - 9 for  any 
9 € (-1,1).  Such  a discriminant  raisclassif ies  only  the  two  outlier; . 
namely,  -(k  + 1)  in  and  (4+1)  in  tf  . However,  the  maximum 
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quality  hyperplane  produced  by  (2.^.1)  is  quite  different.  Tightness  of 
the  constraints  ||w|(  >1  at  optimality  implies  that  w = + 1 or  w = -1 
for  any  l norm.  It  is  easily  verified  that  the  optimal  solution  to 

(2.U.1)  is  (w,0)  = (-1,0)  with  ~K  = 1-k.  This  corresponds  to  the  dis- 
criminant f(x)  = -x,  which  misclassifies  all  patterns  in  both  sets  except 
the  two  outliers.  For  large  values  of  k this  ranks  among  the  worst 
choices  of  possible  discriminants.  □ 


The  difficulty  of  computation  and  possible  poor  performance  of 
the  maximum  quality  hyperplane  for  linearly  inseparable  problems  suggests 
the  need  for  alternative  procedures.  Such  procedures  are  the  subject  of 
Chapter  b.  In  particular,  the  linear  program 

min  e-s 

(2.4.7)  s.t.  Au  + Is  > e 

s > 0 

_ _n+l  _ _m 
u £ ]R  , s c ]R 

is  discussed.  This  program  determines  a separating  hyperplane  if  one 
exists,  but  the  solution  does  not  necessarily  have  any  of  the  desirable 
properties  of  a maximum  quality  discriminant.  However,  in  the  linearly 
inseparable  case,  (2.4.7)  is  much  easier  to  solve  than  a maximum  quality 
problem  and  places  less  emphasis  on  outlying  patterns.  For  the  example 
cited  above,  it  is  shown  that  the  optimal  solution  to  (2.4.7)  yields  the 
discriminant  f(x)  = x,  which  is  in  the  set  of  lowest  error  rate  dis- 
criminants. 
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CHAPTER  3 


THE  LEAST  POSITIVE  DEVIATIONS  PROBLEM 


3.1.  Linear  Inequalities 

This  chapter  deals  with  the  general  linear  inequality  system 


(3.1.1) 


Ax  > b 
x e jRn 


where  A is  a (m  x n)  matrix  with  m > n and  b 6 IP.111 . The  matrix  A 
is  assumed  to  be  of  full  column  rank  n.  Let  a^  denote  the  ith  row  of 
A and  the  ith  component  of  b.  Thus  the  ith  inequality  is 

a^x  > 0^. 

A solution  to  (3.1.1),  if  one  exists,  can  be  found  as  the  optimal 
solution  to  the  Phase  I linear  program 


(3.1.2) 


min  e • s 

s.t.  Ax  + Is  > b 
s > 0 
x € ]Rn  , s £ 


Problem  (3.1.2)  will  be  called  the  least  positive  deviations  (LPD) 
problem  corresponding  to  the  tableau  [A:b].  If  (x,s)  is  an  optimal 
solution  to  (3.1.2),  then  x will  be  called  a LPD  solution  to  the 
inequality  system  (3.1.1).  A LPD  solution  always  exists  since  the  LPD 
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linear  program  is  feasible  (x  = 0,  s = b is  a feasible  solution)  and 
the  objective  function  is  bounded  below  by  zero  on  the  constraint  set. 
Clearly  system  (3.1.1)  has  a feasible  solution  iff  the  LPD  solution  is 
a feasible  solution.  In  this  case  the  optimal  LPD  objective  value  is 
equal  to  zero. 

The  least  positive  deviations  terminology  arises  from  the 
equivalence  between  the  LPD  linear  program  and  the  unconstrained  minimiza- 


tion problem 


(3.1.3) 


m 4- 

f (x)  = E (3.  - a. -x) 


x£  m 


= e- (b  - Ax)+ 

where  (3^  - a^>x)+  = max(0,  0^  - a^-x).  If  (x,s)  is  optimal  for 
(3.1.2),  then  it  is  easy  to  show  that  s = (b  - Ax)+.  Furthermore,  for 
any  x € lRn  , (x,  (b  - Ax)  + ) is  feasible  for  (3.1.2).  Together  these 
statements  imply  that  (x,s)  is  optimal  for  (3.1.2)  iff  x is  optimal 
for  (3.1.3),  where  s = (b  - A x)+. 

The  LPD  linear  program  (3.1.2)  can  be  written  in  the  standard  form 


(3.1.fc) 


min  e* s, 


s.t.  Ax^  - Ax^  + Is1  - ISp  = b 

X1  - *2  - si  - s2  > 0 

€ lRn,  xo  € 3Rn,  s € lRm,  s2  € Xp 


This  primal  formulation  has  m constraints  in  2(m  + n)  non-negative 
variables.  The  dual  of  (3.1. U)  is 


max  b*y 


(3.1.5)  s.t.  A'y  = 0 

0 < y < e 
y £ 3Rm 

which  has  n constraints  in  m upper  hounded  non- negative  variables.  If 

the  number  of  inequalities  m in  (3.1.1)  is  large  relative  to  n,  then 

the  simplex  method  with  upper  bounds  (Dantzig  [ 11] ) applied  to  the  dual 

(3.1.5)  would  be  computationally  more  convenient  and  probably  more 

efficient  than  the  standard  simplex  method  applied  to  the  primal  (3.1.*+). 

When  applied  to  (3.1.5),  the  simplex  method  with  upper  bounds 

terminates  in  a basic  optimal  solution  y which  has  n basic  variables 

(y) . , ...  , (£) . and  (m-n)  non-basic  variables.  Each  non-basic 
11  1n 

variable  is  either  equal  to  its  lower  bound  of  zero  or  its  upper  bound 

of  one,  while  the  basic  variables  are  equal  to  values  lying  in  the  interval 

(0,1].  The  optimal  basis  defined  by  y consists  of  n column  vectors 

a.  , ...  , a.  from  A'.  The  simplex  multiplier  vector  x corresponding 
X1  1n 

to  the  optimal  basis  is  the  solution  to  the  (n  X n)  linear  equality  system 


(3.1.6)  : 

a.  ’x  = 0 
i n 

n 

From  the  duality  relationship  between  (3.1.*+)  and  (3.1.5)  it  follows  that 
x defines  the  optimal  solution 
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(3.1.7) 

s^  = (b  - Ax) 
s2  = (b  - Ax)" 

to  the  primal  (3.1.4)  and  hence  x is  a LPD  solution  to  (3.1.1).  The 
termination  condition  of  the  simplex  method  with  upper  bounds  requires 
that  the  columns  of  A'  price  out  as  follows 

a^*x  = p if  (£) i is  basic 

(3.1.8)  a^-x  > p if  (y).^  =0  and  is  non-basic 

ai*£  - 3^  (y)j_  = 1 011(1  is  non-basic 

If  the  primal  solution  (3.1.7)  is  non-degenerate,  there  are  no  basic  slack 

variables  equal  to  zero  and  the  inequalities  in  (3.1.8)  are  strict.  Thus 

if  a non-basic  optimal  dual  variable  is  at  its  lower  bound  the  corresponding 

inequality  in  (3.1.1)  is  satisfied  at  x = x,  while  if  it  is  at  its  upper 

bound  the  inequality  is  violated  (assuming  a non-degenerate  primal  solution). 

If  the  dual  variable  is  basic,  the  inequality  is  tight. 

Thus  the  search  for  a LPD  solution  to  (3.1.1)  can  be  confined  to 

simplex  multiplier  vectors  associated  with  bases  for  the  dual  problem 

(3.1.5).  The  following  terminology  will  be  used  to  describe  these  vectors. 

A point  x € lRn  is  defined  to  be  a basic  inequality  solution  to  the 

inequality  system  Ax  > b if  at  least  n of  the  m inequalities  are 

tight  at  x = X and  the  corresponding  row  vectors  a^  , ...  , a^  are 

linearly  independent.  If  exactly  n inequalities  are  tight,  x is  non- 

degenerate.  For  any  linearly  independent  set  of  n row  vectors 

a , . . . , a,  , there  is  exactly  one  basic  inequality  solution  x which 
H n 
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can  be  found  by  solving  the  linear  equality  system  (3.1.6).  Each  basic 
inequality  solution  x to  (3.1.1)  defines  the  corresponding  basic  feasible 
solution  to  the  primal  linear  program  (3.1.t)  given  by 

(3.1.9)  (xL,  x2,  sx,  s2)  = (x+,  x-,  (b  - Ax)+,  (b  - Ax  )") 

Conversely,  however,  each  basic  feasible  solution  to  (3.1.^)  does  not 
necessarily  define  a basic  inequality  solution  to  (3.1.1).  For  example, 
the  basic  feasible  solution  to  (3.1.^) 

(xx,  xg,  sL,  s2)  = (0,  0,  b+,  b") 

corresponds  to  the  non-basic  inequality  solution  x = 0 to  (3.1.1). 

Two  basic  inequality  solutions  x^,  x,(  are  defined  to  be 
adjacent  if  the  corresponding  dual  bases  have  exactly  (n-1)  column  vectors 
of  A'  in  common.  Thus  the  simplex  multiplier  vectors  x^,  x^+1  at 
successive  iterations  of  the  simplex  method  with  upper  bounds  applied  to 
the  dual  are  adjacent  basic  inequality  solutions  to  (3.1.1).  Again, 
however,  the  basic  feasible  solutions  defined  by  (3.1.9)  for  two  adjacent 
basic  inequality  solutions  x^,  x,  to  (3.1.1)  are  not  necessarily 
adjacent  basic  feasible  solutions  to  the  primal  problem  (3.1.^).  The 
reason  is  that  the  two  sets  of  (m-n)  basic  slack  variables  can  be 
completely  different.  The  algorithm  presented  below  for  determining  LPD 
solutions  gains  considerable  computational  efficiency  by  moving  only 
between  adjacent  basic  inequality  solutions  to  (3.1.1),  thus  avoiding 
pivoting  operations  at  intermediate  basic  feasible  solutions  to  (3.1. M 
where  only  slack  variables  are  entering  and  leaving  the  basis. 
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If  the  simplex  method  with  upper  bounds  is  applied  to  the  dual, 
the  dual  objective  function  b-y  increases  at  each  step  (assuming  non- 
degeneracy). However,  it  is  not  true  in  general  that  f(x.  . ) < f(x  ) 

K't'  x K. 

for  the  corresponding  simplex  multiplier  vectors  x^,  where 

f (x)  = e- (b  - Ax)  is  the  LPD  objective  function.  Thus  intermediate 
multiplier  vectors  may  be  quite  far  from  being  optimal  for  the  primal,  and 
hence  the  dual  problem  must  be  iterated  to  completion  to  obtain  a good  (in 
this  case,  optimal)  basic  solution  to  the  primal.  From  numerical  experience 
on  LPD  problems  of  pattern  recognition  and  control  theory  origin,  it  has 
been  observed  that  the  structure  of  the  constraint  set  can  be  very  com- 
plicated even  in  relatively  small  problems  (e.g.  m < 1000,  n < 11)  with 
consequent  slow  convergence  of  the  simplex  method  with  upper  bounds. 

In  the  next  sections  an  algorithm  for  the  LPD  problem  is  presented 
that  has  proved  to  be  very  efficient  on  many  numerical  test  problems, 
particularly  on  those  in  which  the  system  Ax  > b is  infeasible.  The 
algorithm  produces  a finite  sequence  {x^j  of  basic  inequality  solutions 
to  (3.1.1)  that  terminates  in  a LPD  solution.  Members  of  the  sequence 
are  shown  to  be  obtainable  as  the  simplex  multiplier  vectors  corresponding 
to  the  path  of  bases  produced  by  a modification  to  the  usual  pivot  selection 
rules  in  the  simplex  method  with  upper  bounds  applied  to  the  dual. 

Assuming  non-degeneracy,  the  new  pivot  selection  rules  produce  a decrease 
in  the  primal  objective  rather  than  an  increase  in  the  dual  objective 
at  each  basis  change  (the  upper  and  lower  bounds  on  the  dual  variables  do 
not  enter  into  the  calculation  and  are  thus  ignored) . 
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3.2.  The  One-Dimensional  LPD  Problem 


For  the  case 


(5.2.1) 


n = 1,  the  inequality  system  (3.1.1)  has  the  form 


h\ 

lh\ 

X ^ 

% 

• 

: 

: 

• 

\ a / 
' m 

where  x is  a scalar  variable.  It  is  easily  seen  that  system  (3.2.1)  is 
feasible  iff 


Pi 

t,  = max  — < min  — = r„ 

1 a . >0  ai  - a.  <0  ai  2 
i i 


and  any  x in  the  interval  [ T-j_>  T2 1 is  a feasible  solution  (t  = - °° 
if  all  a.  are  non-positive;  similarly  r„  = + « if  all  a.  are  non- 

l 2 l 

negative).  However,  if  the  system  (3.2.1)  is  infeasible,  a more  general 
approach  is  necessary  to  find  a LPD  solution.  In  place  of  the  linear 
programming  approach  presented  in  the  last  section,  a more  direct  solution 
technique  for  the  LPD  problem  is  discussed  below.  This  method  treats  the 
problem  in  the  unconstrained  form 


(5.2.2) 


min  f(x)  = Tj  (p  - a.x)' 
x £ IR1  i=l  i 


Without  loss  of  generality,  it  is  assumed  that  CL  / 0,  i = 1, ...,m. 

Let  f(x)  = Ei™,  f.(x),  where  f.(x)  = (0,  - a.x)+.  A typical 

1 —X  1 1 11 

f^(x)  for  CL  > 0 and  cl  < 0 is  graphed  in  Figure  (5*2.3).  I*1  either 

case  the  graph  consists  of  two  linear  segments  with  a breakpoint  at 
x = 0^/ol.  At  the  breakpoint  the  slope  increases  by  |cl|.  Figure  (5.2.U) 
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Figure  (3.2.4).  Sum  of  Six  Positive  Deviation  Functions,  with 
Minimum  at  x - y 
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illustrates  ^he  graph  of  the  sum  of  six  such  functions.  In  general, 

assuming  the  points  f^/cn , i = 1, ...,ra  are  distinct,  the  graph  of  f(x) 

consists  of  (m  + 1)  linear  segments  with  breakpoints  at  each  p./a.. 

1 1 

The  right-hand  derivative  at  each  breakpoint  0^/co  exceeds  the  left-hand 
derivative  by  la^J.  The  extreme  left  hand  infinite  segment  has  slope 
equal  to  "(^=p  Qh)  and  the  extreme  right-hand  infinite  segment  has 
slope  equal  to  Z!?_^  a^ . If  there  is  no  flat  (zero  slope)  segment,  f(x) 
has  a unique  minimum  at  the  breakpoint  where  the  right-hand  derivative  first 
becomes  positive.  If  there  is  a flat  segment,  all  points  along  this 
segment  are  minima. 

From  the  observations  above  it  follows  that  the  right  and  left 
hand  derivatives  at  any  point  x = x are  given  by  the  formulas 


(3.2.5) 


df(x) 

+ 

dx 


/\ 

x=x 


m + 

Z a + Z 

i=l  1 [i:p./a.  <x) 

i l — 


df(x) 

dx 


x=x 


m + 

Z a.  + Z 
i=l  1 {i:p./a.<x) 


These  formulas  are  the  basis  of  the  following  solution  procedure  for  the 
one-dimensional  LPD  problem  (3.2.2) 

PROCEDURE  (3.2.6) . One-Dimensional  LPD  Solution  Procedure 
1.  Sort  the  m breakpoints  P^Qd  into  ascending  order.  If  there  are 
repeated  instances  of  any  breakpoint,  all  such  instances  must  be 
included  in  the  ordered  list.  Reindex  and  let  the  index  i now 
refer  to  the  new  order. 


4l 


2.  Let 


m . i 

Y.  = - I a.  + T.  \ot.  | , j = 1,  ...,m 

J i=l  1 i-1 

and  let  j*  be  the  smallest  value  of  j for  which  r ^ > 0.  Then 

the  breakpoint  x*  = e,*/a.*  is  optimal  for  (3.2.2). 

J J 


Proof.  By  (3.2.5). 

. df (x) 
r < — 

J dx  x=g./a. 

J i 

with  equality  if  only  one  inequality  is  tight  at  x = Thus  by 

definition  of  y . and  j*, 

J 


But  (3.2.7)  are  precisely  the  necessary  and  sufficient  conditions  for  a 
convex  piecewise  linear  function  of  a scalar  variable  x to  be  minimized 
by  x = x*.  □ 


The  procedure  is  implemented  by  ordering  the  breakpoints  and  successively 
adding  the  slope  changes  |ail  to  the  initial  left-hand  derivative.  The 
procedure  stops  at  an  optimal  breakpoint  when  this  sum,  and  hence  the 
right-hand  derivative,  first  becomes  non-negative.  If  the  minimum  is 
not  unique,  the  solution  that  is  produced  is  thus  the  smallest  minimizing 
breakpoint. 

As  an  alternative  to  (3.2.5).  the  derivatives  at  any  point  x 
can  be  calculated  from  the  formulas 
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(- 


X-X 


df(x) 

+ 

dx 


= E - a.  + E aT 

(i:a .Sc <3  } (i:a  5c=p.)  1 

1 j.  11 


(3.2.8) 


df^xl 

dX 


E - a.  - E < 

x=x  {i:ai«<Pi)  [i-.^x  = pi) 


Rather  than  starting  from  the  numerically  smallest  "breakpoint  as  in 
Procedure  (3.2.6),  the  search  procedure  can  be  initiated  from  an  arbitrary 
breakpoint  through  the  use  of  these  formulas. 

PROCEDURE  (3.2.9).  Modified  One-Dimensional  LPD  Solution  Procedure 

1.  Select  an  arbitrary  breakpoint  x = 3^/a.  and  calculate  the  left 
and  right-hand  derivatives  at  x from  (3.2.8).  If  x is  optimal 
by  (3.2.7),  stop.  Otherwise  go  to  step  2. 

2.  If  (df /dx+) | < 0,  sort  the  breakpoints  that  are  strictly  greater 

than  x into  ascending  order  and  let  the  index  i refer  to  this  order. 
If  (df/dx~) lx_£  > 0 sort  the  breakpoints  that  are  strictly  less  than 
x into  descending  order  indexed  by  i.  In  either  case  include  all 
instances  of  repeated  breakpoints. 

Define 


Let  j*  be  the  smallest  value  of  J such  that 

rj*  > 0 and  rj*  - |aJ#|  £ 0 

Then  x*  = Pj*/01*#  is  optimal  for  (3.2.2). 


*0 


Proof.  From  (3.2. 5)  and  (3.2.8),  it  follows  that 


r < — 


r3  - |ad'  2 


df 

dx’ 


x*3/aJ 


with  equality  in  each  case  if  only  one  inequality  is  tight  at 
Thus  by  definition  of  J*,  the  optimality  criterion  (3.2.7)  is 
at  x =■  x*.  □ 


x = p ./a. 
j a 

sat is i fed 


The  procedure  first  determines  on  which  side  of  the  initial  break 
point  x the  minimum  lies  based  on  the  algebraic  sign  of  the  right-hand 
derivative  at  x.  Successive  breakpoints  are  then  examined  and  the 
derivative  updated  until  an  optimum  breakpoint  satisfying  (3.2.7)  is 
found.  Procedure  (3.2.9)  will  be  incorporated  into  an  algorithm  for 
solving  the  general  n-dimensional  LPL  problem.  It  is  used  to  solve  one- 
dimensional problems  of  the  form 

min  f ( x + t d) 
r € 3n 

where  x G ]Rn  is  a basic  inequality  solution  to  (3.1.1)  and  d € IRn 
is  a search  direction. 

Let  x € ]Rn  be  a basic  inequality  solution  to  Ax  > b with 

a,  , a.  , ...  , a the  linearly  independent  row  vectors  corresponding 
1 2 n 

to  n tight  inequalities  at  x = x.  Define  the  (n  x n)  matrix 


kb 


/ a . ' 


(3.2.10) 


Al  = 


M 


Y'-l 


Then  the  system  Ax  > h can  be  rearranged  and  partitioned  as 


(3.2.11) 


A1X  ^ bl 
A2X  ^ b2 


with  x = A”1!), . The  following  equations  define  a search  direction 


*1  “1 


dk  G ]Rn  such  that  all  but  the  kth  inequalities  in  A x > b remain 

1 — 1 

tight  for  x = x + Tdk,  t ^ 0: 


(3.2.12) 


a.  -d  = 0 , j / k 


a . • d = 1 
ik 


k - L 

Thus  d is  the  kth  column  vector  of  A^  . Improvement  in  the  LPD 

objective  function  can  be  attempted  by  solving  the  one-dimensional  problem 


(3.2.13) 


min  f (x  + td  ) 
t€  1R1 


This  can  be  rewritten  as  the  one-dimensional  LPD  problem 


(3.2. lU) 


min  [e-(b  - ar)+  + (-x)+] 


Ten1 


where 


a = A„d 


b = b2  - a2*  . 
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Let  t*  = (b)  J (a)  be  a minimizing  breakpoint  for  (3.2. lU). 

i*  I* 

i 

PROPOSITION  (3.2.15).  If  x = 0 is  not  optimal  for  problem  (3.2. lU),  ! 

then  x + i*d  and  5 are  adjacent  basic  inequality  solutions  with 
f (x  + x*dk)  < f (4). 

Proof.  From  the  defining  equations  (3.2.12)  for  d^,  it  follows  that  the 
n inequalities  A^x  > b^,  which  are  tight  at  t = 0,  remain  tight  at 
t = r*  except  for  the  inequality  corresponding  to  a . An  additional 

inequality  a.^x  > p.„  in  the  system  A0x  > b0  that  was  not  tight  at  ' 

I*  — 1*  c.  — d 

t = 0 becomes  tight  at  x = x*.  The  row  vector  a. „ cannot  be  linearly 

l-*" 

k ■ 

dependent  on  (a.  , a.  , a.  , ...  , a );  otherwise  a.^'d  = 0, 

11  k-1  k+1  n 

implying  a.  •$  = p and  hence  (b).,,  = 0 which  contradicts  non-optimality 

l*-  I'*’ 

at  x = 0.  Thus  x + x+c^  is  a basic  inequality  solution  corresponding 
to  the  dual  basis  {a.  , ...  , a.  , a.*,  a , ...  , a ) . Non- 

in  i,  -i  i*  l,  . , i 

1 k-l  . k+1  n 

optimality  at  x = 0 implies  f (x  + x*dk)  < f (5c) . □ 

Proposition  (3.2.15)  immediately  suggests  the  algorithm  for  the 
LPD  problem  that  is  presented  in  the  next  section.  Starting  with  an 
arbitrary  basic  inequality  solution,  the  algorithm  generates  a sequence 
of  improved  adjacent  basic  inequality  solutions  by  solving  one-dimensional 
LPDs  of  the  form  (3.2.14).  Computationally,  the  algorithm  is  shown  to  be 
implementable  by  changing  the  pivot  selection  rules  of  the  simplex  method 
with  upper  bounds  as  applied  to  the  dual.  As  in  the  simplex  method,  the 
algorithm  terminates  in  a finite  number  of  steps  with  an  optimal  solution. 
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3.3.  The  ALPD  Algorithm 


This  section  presents  the  Accelerated  Least  Positive  Deviations 
(ALPD)  algorithm  for  determining  a LPD  solution  to  the  system  Ax  > b. 

It  will  be  assumed  that  all  basic  inequality  solutions  are  non-degenerate. 
If  this  is  not  true,  the  vector  b can  be  perturbed  by 

(b* ) L = (b)1  + 

and  the  system  Ax  > b'  will  be  non-degenerate  for  all  sufficiently  small 
positive  values  of  e and  the  optimal  dual  basis  for  the  perturbed  system 
will  also  be  optimal  for  the  original  system.  The  standard  lexicographic 
schemes  (Dantzig  [11])  for  the  simplex  method  can  be  used  intact. 

Let 

k = iteration  number 

x,  = basic  solution  at  iteration  k 
k 

A^  = (n  x n)  non-singular  submatrix  of  A such  that  the 

corresponding  inequalities  are  tight  at  x^ 

d,1  = ith  column  of  A,  ^ 
k k 

f (x)  = e* (b  - Ax) f . 

ALGORITHM  (3.3.1).  ALPD 

Step  0.  Set  k = 1.  Let  A^x  > b be  any  set  of  n inequalities 

such  that  A^  is  non-singular.  Set  x^  = Go  to  Step  1. 


1*7 


Step  1.  Determine  the  right  and  left-hand  derivatives 


for  i = 1,...,  n. 

Let  A,  = min(r . , -6.)  and  A,„  = min  {A.).  If 

1 11  A*  . , 1 

l“lj  • • • y n 

— °»  80  to  Step  3.  Otherwise,  go  to  Step  2. 

Step  2.  Using  (3.2.9)  with  initial  breakpoint  t = 0,  solve  the  one- 

dimensional  LPD  problem  min  f(x  + rd  ).  Let  r*  be  the 

t K K 

i* 

minimizing  breakpoint.  Set  x^^  = + t*dk  and  form  A^+1 

by  replacing  the  i*th  row  of  A^  with  the  row  of  A correspond- 
ing to  the  breakpoint  at  t*.  Increment  k by  1 and  go  to  Step  1. 


Step  3.  Stop.  The  final  x^  is  optimal. 

PROPOSITION  (3.3.2).  Under  the  non-degeneracy  assumption,  the  ALPD 
algorithm  converges  in  a finite  number  of  steps  to  an  optimal  basic 
solution. 


Proof . Let  be  an  Intermediate  basic  solution.  Since  x^  is  not  the 

final  solution,  by  Step  1 and  convexity  of  f(x)  the  right  and  left-hand 
derivatives  y\#  and  8 are  non-zero  and  have  the  same  algebraic  sign. 
Hence  i = 0 cannot  be  optimal  in  Step  2 and  therefore  by  Proposition 
(3.2.15)  x^  and  x^+1  are  adjacent  basic  solutions  with  f(x.+1)  < f(x^). 
Thus  cycling  cannot  occur.  There  are  at  most  (”)  basic  solutions,  so 
the  algorithm  must  be  finite.  It  remains  to  be  shown  that  the  final  basic 


df(xk  + Td^) 


df(Xk  + rdk} 


l . 


solution  x is  optimal.  This  will  be  bone  by  demonstrating  that  the 
corresponding  point 


(x+. 


’)  = (x 


(b  - Ax)  , (b  - Ax)  ) 


is  optimal  for  the  LPD  linear  program  (3.1.4). 

Let  A^  be  the  n X n submatrix  of  A corresponding  to  the  basic 
tight  inequalities  at  x = x.  Then  the  system  Ax  > b can  be  -written  as 


A1  x > b 
A_x  > b 


1 

2 


where  A^  is  an  (m-n)  x n matrix  corresponding  to  the  remaining  non-basic 
inequalities.  The  linear  program  (3.1.4)  then  takes  the  form 


. + 

min 

+ 

+ + 
e.  + e* 

AjX 

- Anx 

+ Isi 

- Isl 

= b 

+ 

_ 

+ 

Anx 

C- 

- A0x 

+ IS2 

Is"  = b, 

x e ir”  , Sj  € iRn , s0  e mm"n 

Since  A^  is  non-singular,  this  problem  can  be  transformed  by  elementary 
row  operations  (Gaussian  elimination)  to  yield 


(3.3.3) 


min  e-  s-^  + e.  s2 


s.  t. 


Ix 


lx 


Allsl  - 


A-V 

1 S1 


ISp  - Is,,  - A A^s*  + A.  A^s^ 


b2  - 


x e TR 


sL  € TR 


n 


s2  € ]R 


m-n 
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The  following  basic  feasible  solution  to  (3.5.5)  can  be  selected 


/ + - + — \ 
from  (x  , x , s , s ) : 


For  i = 1, . . . , n 


(3.3. 4) 


For  j = 1, . . . ,m-n 


+ / — 1 
(x  ) . is  basic  if  (A.,  'ub, ) . > 0 

l 111 

(x-).  is  basic  if  (A"'Lb, ) . < 0 
s*  is  basic  if  (b0  - ^A"1!^) i > 


s"  is  basic  if  (bg  - < 0 


This  solution  corresponds  to  the  basic  solution  x = A^^b^  to  (3.1.1). 
(3.3.4)  is  optimal  if  the  reduced  costs  for  all  non-basic  variables  are 
non-negative. 

Let  (y-^yg),  where  y^  £ ZRn  , y 2 £ ]Rm”n,  be  the  simplex 
multiplier  vector  associated  with  the  basic  feasible  solution  (3.3.4). 
From  the  form  of  (3.3.3)  it  follows  that  y^  = 0 and  hence  the  reduced 
cost  for  all  non-basic  components  of  x+  euid  x”  equals  zero.  Also 
from  (3.3.3)  it  is  seen  that 


(3.3.5) 


(y2) j =1  if  (s2)j  is  basic 

(y2)j  =0  if  (s')j  is  basic 


Thus  the  reduced  cost  for  all  the  non-basic  components  of  sa  and  s2 
equals  + 1.  The  reduced  cost  for  (s*^ 


(3.3.6) 


1 + y2A2d  ’ 


i = 1, . . . ,n 


where  d1  is  the  ith  column  of  A~\  Similarly,  the  reduced  cost  for 
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is 


(3.3.7) 


-y2A2d 


j • • • y n. 


Application  of  the  formulas  (3.2.8)  to  the  function 

(5*5.8)  f(x  + xd1)  = e-(b  - aT)+  + (-T)+  , 

where  a = Agd  and  b = h^  - A2A^H)^,  y^el^s 


and 


df  (x  + rd1) 
dT" 


m-n 


x=0 


e (a/j: 
j=i  0 


-y2A2d 


ri 


df (x  + Td1) 
dr 


T=0 


m-n 

- 1 - E (Ad1) 

d=i  2 J 


= - 1 ‘ y2A2d 


" 5i 


Comparison  with  (3.3.6)  and  (3.3.7)  reveals  that  ^ and  -B  are  the 
reduced  costs  for  (s^)^  a-nd  (s*)^,  resPectively. 

But  by  the  termination  condition  in  Step  1, 


r.  > 0 , 


-8.  > 0 , 

l — ’ 


i = 1, . . .,n. 


at  x,  so  x is  optimal. 
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The  ALPD  algorithm  can  be  implemented  using  a pivotal  procedure 


on  tableaus.  Define  the  tableau 


corresponding  to  the  basic  solution  x = A^,  By  elementary  column 
operations  (equivalent  to  Gaussian  elimination  row  operations  on  the 
transpose  of  T),  T can  be  transformed  into  the  canonical  form 


I 

0 

(3.3.9) 

»-3 

o 

II 

. VI1 

b2  - ^ 

Let  be  the  ith  column  of  A^A^,  i = 1,  ...,n  and  let  b = b^  - 

Then  the  right  and  left-hand  derivatives  yu,  of  the  function 
f ^ ( t)  = e*(b  - a1!)"1"  + (-t)  + at  t = 0 can  be  calculated  from  (3.2.8). 
The  fastest  rate  of  descent  (minimum  reduced  cost  for  the  primal  problem) 
is 


(3.3.10)  Aj*  = min  {min(y\,  -&±) 3 

x 1^  • 1 1 j n 

If  A..  > 0,  then  the  current  solution  is  optimal.  Otherwise  the  i*th 
row  of  A^  will  leave  the  dual  basis.  It  will  be  replaced  by  the  j*th 
row  of  Ap,  where 

(3.3.11)  (b 

is  the  minimizing  breakpoint  of  f.^Cx). 
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This  is  accomplished,  by  executing  a standard  simplex  method  pivot 

on  the  transpose  of  T , using  the  (i*,  j*  + n)  element  of  T'  as 

c c 

the  pivot  element.  The  new  tableau,  after  rearrangement,  will  be  in  the 
canonical  form  corresponding  to  the  new  dual  basis. 

The  pivot  operation  used  to  move  from  one  basic  solution  to  an 
adjacent  basic  solution  is  thus  the  same  as  that  used  by  the  simplex  method 
with  upper  bounds  applied  to  the  dual  (3.1.5)  in  exchanging  one  basic 
column  of  A'  for  another.  Thus  the  ALPD  algorithm  can  be  implemented 
simply  by  changing  the  pivot  selection  rule  in  standard  simplex  method 
software  and  ignoring  the  upper  bounds  on  the  dual  variables.  The  new 
pivot  rule  selects  the  column  that  leaves  the  dual  basis  according  to 
the  minimum  reduced  cost  rule  (3.3.10).  The  entering  column  is  selected 
as  the  one  corresponding  to  the  breakpoint  that  minimizes  the  LPD 
objective  function  (3.3.9).  Each  iteration  then  results  in  a decreased 
primal  objective  rather  than  increased  dual  objective. 

The  relative  efficiencies  of  the  ALPD  and  simplex  method  pivot 
selection  rules  can  be  compared  directly  by  counting  the  number  of  basis 
changes  (pivot  operations)  required  to  reach  optimality  from  a given 
starting  basic.  The  ALPD  algorithm  has  been  coded  in  FORTRAN  and  applied 
to  numerous  small  (typically  m < 1000,  n < 11)  LPD  problems  arising 
from  pattern  classification  models.  These  problems  generally  have  a 
totally  dense  A with  b > 0.  Comparative  runs  with  the  simplex  method 
with  upper  bounds  have  been  made  with  the  following  general  results. 

In  cases  where  the  inequality  system  Ax  > b is  feasible,  the  two  pivot 
rules  require  approximately  the  same  number  of  pivots.  However,  in  cases 
where  the  system  is  infeasible,  the  number  of  simplex  method  pivots 
grows  rapidly  with  the  extent  of  the  infeasibility,  i.e.  the  number  of 
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inequalities  violated  by  the  optimal  solution.  The  number  of  ALPD 
pivots  appears  insensitive  to  this  factor.  In  many  infeasible  cases 
the  number  of  simplex  pivots  exceeded  the  number  of  ALPD  pivots  by 
factors  of  several  hundred.  Detailed  results  of  a series  of  systematic 
comparison  trials  are  presented  in  the  next  chapter. 


3.U.  Initializing  the  Algorithm 

The  choice  of  the  initial  basic  solution  is  arbitrary.  However, 
the  following  procedure  produces  an  initial  basic  solution  by  constructing 
a sequence  (x^,  x^,  ...  , xn)  of  points  such  that 

f(Xfc)  < f(xk_1)  > k = 1, ...,n 

The  final  point  x is  the  desired  initial  basic  solution.  Thus  a 
n 

considerable  amount  of  improvement  in  the  objective  function  may  be 
achieved  in  the  initiation  sequence. 

Let 

k = iteration  number 

= partial  set  of  vectors  in  the  dual  basis 
at  iteration  k. 

PROCEDURE  (3.U.1).  ALPD  Initialization. 

Step  1.  Set  Bq  = xQ  = 0. 

Choose  an  arbitrary  direction  5^  0 (the  unit  vector 

dQ  = (1,0,...,  0)  is  convenient);  set  k = 1 and  go  to  Step  2. 


5h 


Step  2.  Solve  the  one-dimensional  LPD  problem 


min  f(xk_1  + t dk_x) 

T 

Let  a be  the  row  vector  corresponding  to  the  inequality  that 
becomes  tight  at  the  optimizing  t = t*.  Set 


Step  3. 


S.  = ®.,U{a)  x,  = x.  ..  + t*  d,  , 

k k-1  1 X k-1  k-1 

If  k = n,  go  to  Step  4.  Otherwise  go  to  Step  3. 

Determine  a new  direction  d.  / 0 such  that 

k 7 

dk‘  a1  = 0 , i = 1, . . k, 


(The  Gram-Schmidt  orthogonalization  procedure  can  be  used.) 
Increment  k by  one  and  go  to  Step  2. 

Step  4.  Stop.  #n  is  the  initial  dual  basis  and  xn  is  the  initial 
basic  solution. 


After  Step  2 of  iteration  k,  the  k inequalities  corresponding  to 
1 k 

a , ...  , a are  tight  at  x^  Each  new  search  direction  is  generated 
in  such  a way  that  these  inequalities  remain  tight  at  where  a new 

inequality  becomes  tight. 


¥ 


3.5.  Extensions  of  the  LPD  Problem 

In  this  section  a sequence  of  increasingly  general  linear  programs 
are  shown  to  be  reducible  to  equivalent  LPD  linear  programs.  Ultimately 
the  applicability  of  the  ALPD  algorithm  to  the  general  linear  programming 
problem  is  demonstrated. 


i . 
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The  weighted  LPD  problem  with  tableau  [ A : b ] and  weight  vector 


w > 0 is  defined  as  the  linear  program 

min  w*s 

(3.5.1)  s.t.  Ax  + Is  > b 

s > 0 

x € ]Rn  , s 6 l” 

This  is  the  immediate  generalization  of  the  standard  LPD  problem  (3.1.2) 
obtained  by  replacing  the  LPD  objective  e*s  with  the  weighted  LPD 
objective  w*s.  The  equivalent  unconstrained  problem  is 

(3.5.2)  min  f(x)  = w» (b  - Ax)+ 
x€  TRn 

Since  w > 0,  w (b  - Ax)+  = e*  (Wb  - WAx)  where  W is  the  m x m 
diagonal  macrix  defined  by  = (w)^  i = 1,  ...,m.  Thus  (3.5.1)  is 

equivalent  to  a standard  LPD  problem  with  tableau  [WA;Wb].  In  appli- 
cation of  the  ALPD  algorithm  to  (3.5.1),  either  of  the  tableaus  [ A : b ] 
or  [WA:Wb]  may  be  used  for  pivoting  since  both  have  the  same  set  of 
basic  solutions.  However,  the  pivot  selections  are  governed 
by  the  derivatives  of  the  LPD  objective  function  f(x)  = e* (Wb  - WAx)+. 

The  weighted  LPD  problem  can  be  further  generalized  by  allowing 
penalties  on  both  positive  and  negative  deviations.  Let  w c ]Rm  , 
z € ]Rm  be  non-negative  vectors  such  that  w + z > 0.  The  weighted 
least  deviations  problem  with  tableau  [ A: b ] and  weight  vectors  w and 
z is  defined  by  the  linear  program 
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min  w.  s^  + z • Sg 

(5.5.3)  s.t.  Ax  + Is1  - Is  = b 

sx  > 0 , s0  > 0 

x£  /,  sL  € lRm , s2  £ Em 

It  is  easily  seen  that  in  an  optimal  solution  (x  , s^s^)  to  (3.5.3) 
the  relations  = (b  - A £) h and  = (b  - Ax)-  must  hold  and  hence 

(3.5.3)  is  equivalent  to  the  unconstrained  problem 


(3.5.U) 


min  f (x)  = w»  (b  - Ax)+  + z. (b  - Ax) 

x£  1D 


— + 

Since  (b  - Ax)  = (-b  + Ax)  , the  weighted  least  deviations  problem 
(3.5.3)  can  be  reformulated  as  a weighted  LPD  problem.  In  particular, 
if  w > 0 and  z > 0 the  tableau  and  weight  vector  for  this  weighted 
LPD  problem  are 


(3.5.5) 


and  [ w,  z ] 


respectively.  In  application  of  the  ALPD  algorithm  to  (3.5.3)  it  is 
sufficient  to  pivot  on  the  partial  tableau  [A;b]  since  the  pivoting 
operation  preserves  the  opposite  sign  relationship  between  the  upper 
and  lower  halves  of  the  full  tableau.  Again,  however,  the  pivot  selections 
are  determined  by  the  derivatives  of  (3.5.M. 

The  dual  of  (3.5.3)  is 

max  b*y 

(3.5.6)  s.t.  A'y  = 0 

-z  < y < w 

yen” 
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which  differs  from  the  dual  (3.1.5)  of  the  standard  LPD  problem  only  in 
the  generalization  of  the  lower  and  upper  bounds  on  the  dual  variables. 


Example  (3.5.6). 


is 


The  general  linear  approximation  problem  with 


l 


P 


norm  criterion 


(3.5.7)  min  f(x)  = ||Ax  - b|| 

x £ IT  p 

The  vector  b £ ]Rm  is  approximated  by  a linear  combination  of  the  columns 
of  the  m x n matrix  A,  with  the  best  approximation  defined  as  that  which 
minimizes  the  i norm  of  the  residual  vector.  Problems  of  this  type 

p 

arise  in  linear  regression  analysis  and  function  approximation  (’curve 
fitting').  The  choice  p = 2,  equivalent  to  the  usual  least  squares 
criterion  in  regression  analysis,  is  the  simplest  case  both  analytically 
and  computationally,  since  an  explicit  solution  x = (A'A)  1 A'b  exists 
whenever  A'A  is  non-singular.  Also,  in  the  general  linear  statistical 
model  with  the  usual  Gaussian  error  distribution  assumption,  the  maximum 
likelihood  estimate  of  the  coefficient  vector  is  a solution  to  a problem 
of  type  (3.5.7)  with  norm.  However,  it  was  first  suggested  by 

Edgeworth  [12]  that  the  ^ criterion  of  minimizing  the  sum  of  the 
absolute  values  rather  than  the  sum  of  the  squares  of  the  deviations 
may  be  more  suitable  when  the  deviations  are  large  and  erratic.  For 
example,  if  the  error  distribution  is  given  by  the  double  exponential 
distribution  with  probability  density 

f(c)  = (2ct)  P e ^ r , -<»<e<oo 


i 


I 
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I 


then  the  maximum  likelihood  estimate  is  a solution  to  a linear  approximation 
problem  with  norm  criterion  (Draper  and  Smith  [15]).  This  distri- 

bution has  a much  more  slowly  decaying  tail  than  a normal  distribution 

2 - 

with  the  same  variance  cr  and  hence  is  more  likely  to  produce  the 

kind  of  deviation  pattern  mentioned  above.  Before  the  availability  of  j 

linear  programming  techniques,  however,  the  compute  .ional  difficulties 

presented  by  the  criterion  limited  application  ' ° early  solution 

methods  (e.g.  Rhodes  [l4]  and  Singleton  [15])  to  low  dimensional  problems,  < 

typically  n < 5.  The  first  linear  programming  formulation  of  this 

problem  is  due  to  Charnes  and  Cooper  [l6],  who  present  the  least  weighted 

deviations  program  (5.5.3)  with  w = e and  z = e.  The  computational 

advantage  of  the  dual  and  the  applicability  of  the  simplex  method  with 

upper  bounds  is  noted  by  Wagner  [17].  The  ALPD  algorithm  given  here 

is  a generalization  of  a special  purpose  algorithm  for  the  Jtj  norm 

problem  presented  without  proof  by  Davies  [l8],  □ 

The  final  extension  of  the  LPD  problem  considered  here  is  the 
constrained  weighted  LPD  problem  obtained  by  adding  the  p X n inequality 
system 

(3.5.8)  A2x-b2 

x £ iRn 

to  the  constraint  set  of  the  weighted  LPD  problem  (3.5.1).  (A  constrained 
weighted  least  deviations  problem  can  similarly  be  defined  by  adding 
(3.5.8)  to  the  constraint  set  of  the  weighted  least  deviations  problem 
(3.5.3).  As  shown  above,  the  weighted  least  deviations  problem  can  be 
reformulated  as  a weighted  LPD  problem,  so  the  discussion  below  also 
applies  to  this  case.)  The  general  form  of  the  constrained  problem  is 
thus 
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c 


min  w- s 


(3.5.9)  s.t. 


The  (m  + p)  x n matrix 


where  m + p > n,  is  assumed  to  be  of  full  column  rank  n. 

Problem  (3.5.9)  does  not  have  a direct  weighted  LPD  equivalent. 
However,  if  the  inequality  system  (3.5.8)  is  feasible,  the  weighted 
LPD  problem 

min  ws^  + A(e*s?) 

(3.5.10)  s.t.  A.jX  + Is1  > b^ 

A2x  + Is2  > b2 

Si  > 0,  s2  > 0 

x e ®n,  S;L  e ®m,  sc  e # 

will  be  shown  to  have  the  same  set  of  optimal  basic  solutions  as  (3.5.9) 
for  sufficiently  large  values  of  the  scalar  weighting  factor  A. 

PROPOSITION  (3.5.11).  If  the  added  constraints  (3.5.8)  are  feasible, 
then  the  constrained  weighted  LPD  problem  (3.5.9)  has  an  optimal  solution. 

Proof.  If  x is  feasible  for  (3.5.8),  then  x = £,  s = (b  - Ax)  is 

feasible  for  (3.5.9).  The  objective  function  of  (3.5.9)  is  bounded 
below  by  zero,  so  an  optimal  solution  exists.  D 


A^x  + Is  > b^ 

V ^ b2 


s >0 

x £ ]Rn  , s £ IRm 


(l1). 
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LEMMA  (3.5.12).  Let  x*  be  an  optimal  basic  solution  to  (5.5.10) 

for  some  A > 0 and  let  sA  = (b  - Ax*)+.  If  A^x*  > b , then 

11  d ~~  cl 

(x*,s*)  is  optimal  for  (3. 5*9). 

Proof . If  A0x*  > b?,  then  there  exists  an  optimal  solution  (x,s) 
to  (5.5.9)  by  Proposition  (3.5.11).  Since  (x*, s*)  is  feasible  for 
(3.5*9),  vs*  > vs.  But  (x,  s , sj  = (£ , s,0)  is  feasible  for  (3.5.10) 

whereas  (x*,  s£,  0)  is  optimal,  so  w^s*  < w-s.  Thus  w-s*  = w-  s^ 

and  (x*, s^)  is  optimal  for  (3.5.9).  □ 

PR0P0S ITION  (3.5. 15 ) . If  the  added  constraints  (3.5.8)  are  feasible, 
then  there  exists  a number  A > 0 such  that  any  optimal  basic  solution 
x*  to  (3.5.10)  for  A > A defines  an  optimal  solution  x = x*, 
s = (bL  - Ax*)+  to  (3.5.9). 

Proof.  Let  9 = (x^  . . . , xfc)  be  the  set  of  basic  solutions  to 
(3.5.10)  that  are  infeasible  for  the  system  AQx  > b0.  ® is  a finite, 

possibly  empty  set.  If  $ is  empty,  let  A = 0 and  the  result  follows 
from  the  lemma.  If  33  is  non-empty,  define 

5 = min  e-  [b  - Ax  ]+ 

x.ea  d * 

1 

and  let  A = (w*s)/6  where  (x,s)  is  any  optimal  solution  to  (3.5.9). 
Thus  if  A > A, 

A(e-  [b,,  - Ax^]+)  > v s Vx.  £ S 

Hence  no  member  of  S can  be  optimal  for  (3.5.10)  when  A > A since 
(x,s  , s ,)  = (x, s,0)  is  a feasible  solution  with  a lower  objective  value. 
The  result  then  follows  from  the  lemma.  □ 
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Thus  the  constrained  LPD  problem  can  be  solved  by  application 
of  the  ALPD  algorithm  to  the  weighted  LPD  problem  (3.5.10)  with  any 
weight  factor  A greater  than  X.  In  general  the  value  of  A is 
not  explicitly  known,  so  the  choice  of  A is  open  at  the  start  of  - the 
algorithm.  Numerical  experience  with  constrained  LPD  problems  has 
shown  that  if  A is  initially  chosen  very  large,  the  sequence  of  basic 
solutions  encountered  by  the  ALPD  algorithm  first  is  driven  into  the 
feasible  solution  set  of  (3.5.9)  with  all  subsequent  basic  solutions 
remaining  in  this  set.  For  sufficiently  large  values  of  A,  this 

behavior  is  guaranteed.  It  follows  from  the  argument  used  to  prove 
Proposition  (3.5.13)  that  the  objective  values  corresponding  to  basic 

solutions  that  are  infeasible  for  (3.5.9)  are  bounded  below  by  6* A, 
where  5 > 0,  while  objective  values  for  feasible  basic  solutions  do 
not  depend  on  A.  Thus  for  sufficiently  large  values  of  A,  the  latter 
will  be  uniformly  lower  than  the  former.  In  practice,  once  a feasible 
basic  solution  is  attained,  the  value  of  A can  be  raised  at  any  time 
during  the  course  of  the  algorithm  to  avoid  a pivot  operation  that  would 
result  in  an  infeasible  basic  solution. 

The  following  example  demonstrates  the  applicability  of  the 
ALPD  algorithm  to  the  general  linear  programming  problem. 

Example  (3.5.1M. 

The  general  linear  programming  problem  is 

min  c*x 

(3.5. 15)  ' s.t.  Ax  = b 

x > 0,  x € E° 
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■where  A is  a (m  x n)  matrix  assumed  to  be  of  full  row  rank  m with 
n > m.  The  dual  of  (3.5.15)  is 

max  b*y 

(3.5.16)  s . t . A' y < c 

y £ ]Rm 

If  an  optimal  solution  x to  (3.5.15)  exists,  then  by  duality  theory 
an  optimal  solution  y to  (3.5.16)  exists  and  c-x  = b-y.  In  this  case 
let  a be  any  number  such  that  a > b-y.  Then  the  constrained  LPD 
problem 

min  ct 

(3.5.17)  s.t.  b y + o > a 

-A'y  > -c 

a > 0 

y£  /,  6 € 3R 1 

is  feasible  and  has  an  optimal  solution  (y*,o*)  by  Proposition 
(3.5.11).  It  is  easily  seen  that  cr*  = a - b-y  and  y*  is  an  optimal 
solution  to  the  dual  problem  (3.5.16).  A value  of  a need  not  be  known 
explicitly  for  application  of  the  ALPD  algorithm.  It  is  sufficient  for 
purposes  of  calculating  the  required  derivatives  simply  to  con- 
sider the  inequality  b*y  > a as  always  being  violated.  The  optimal 
solution  to  the  primal  problem  (3.5.15)  conveniently  appears  in  the 
finai  tableau  in  the  row  corresponding  to  this  inequality. 

I 

I 

Numerical  experience  reported  in  the  next  chapter  suggests  that 
if  the  underlying  inequality  system  is  feasible  or  nearly  feasible,  the 
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ALPD  algorithm  is  competitive  with  standard  simplex  software  in  terms 
of  the  total  number  of  basis  changes  but  does  not  offer  any  significant 
computational  advantages.  For  example,  once  a feasible  basic  solution 
to  the  inequality  system  -A'y  > -c  is  reached  in  Example  (3.5.14), 
close  examination  of  the  ALPD  algorithm  reveals  that  for  large  values 
of  the  weight  factor  \ the  pivot  sequence  is  precisely  the  same  as 
that  of  the  dual  simplex  method  (Dantzig  [11])  applied  to  the  primal. 

Thus  the  algorithm  simply  becomes  a convenient  method  of  initializing 
the  dual  simplex  method  if  a basis  with  non-negative  reduced  costs  is 
not  readily  available.  However,  the  ALPD  algorithm  has  shown  a large 
computational  advantage  if  there  are  a large  number  of  infeasibilities 
in  the  inequality  system.  This  case  arises,  for  example,  in  the 
linear  approximation  problem  (3.5.6)  since  both  the  systems  Ax  > b 
and  -Ax  > -b  api ear  in  the  LPD  formulation.  Similarly,  the  algorithm 
should  perform  well  on  the  following  constrained  version  of  this 
problem. 

Example  (3.5.13). 

Let  b^  € 3Rm  be  the  state  vector  at  time  k for  a discrete 
time  control  system  governed  by  the  equation 

Vi  * F-k  * n 

where  F is  a m x m matrix,  is  the  scalar  control  applied  at  time  k, 
and  g £ lRm  is  a constant  vector  representing  the  change  in  the  state 

vector  per  unit  of  applied  control.  Given  b , the  terminal  error 

0 

problem  [19]  requires  the  determination  of  a sequence  of  controls 
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x = (0£q,  a^,  ...  , a11-1)  that  minimizes  the  Z norm  of  the  difference 

between  the  terminal  state  vector  b and  a desired  state  vector  b. 

n 

The  control  sequence  vector  x 6 ]Rn  is  subject  to  the  inequality 
constraints 

a2x  1 c 

where  A0  is  a p x n matrix  and  c 6 TBp  . 

The  terminal  state  bn  is  given  by 

bn  = + Fn  ~gaQ  + Fn_2ga1  + •••  + Fgan_  + gan_1 

where  bQ  is  the  initial  state  vector.  Define  the  m x n matrix  A^  as 

A±  = (Fn_1g,  Fn"2g,  . . . , Fg,  g) 

Then  the  terminal  error  problem  can  be  formulated  as  the  constrained 
least  total  deviations  linear  program 

, *h  — 

min  e*s  + e*s 

s.t.  A^x  + s+  - s-  = b - F^q 
“a2x  > -b0 

s+  > 0,  s“  > 0 

x e ]Rn , s+  e Tm , s'  € Em  . □ 


CHAPTER  U 


THE  LINEARLY  INSEPARABLE  CASE 

k. 1.  The  Stochastic  Classification  Problem 

In  many  practical  applications,  the  patterns  in  a given  class 
can  be  regarded  as  random  vectors  distributed  according  to  some  multi- 
variate probability  distribution.  For  example,  in  the  template-match- 
ing problem  defined  in  Section  (1.2),  each  observed  pattern  in  a given 
class  is  equal  to  the  sum  of  one  of  a finite  number  of  prototype  patterns 
from  that  class  and  a random  displacement  vector  attributable  to  random 
observation  error  or  random  variability  in  the  pattern  population 
itself.  It  was  shown  in  Chapter  2 that  if  the  underlying  prototype 
sets  are  linearly  separable  and  there  exists  a sufficiently  small  bound 
on  the  size  of  the  random  displacement  vectors  as  measured  by  the  £ 
norm,  then  any  sets  of  observed  patterns  from  the  two  classes  are  also 
linearly  separable.  If  the  prototype  sets  are  completely  known,  then 
the  maximum  quality  programs  (2.3.2)  and  (2.3. 4)  with  £ norm 
determine  the  linear  discriminant  that  maximizes  the  bound  on  the  £ 
norm  of  the  displacement  vectors  while  maintaining  linear  separability. 

If,  however,  the  prototype  sets  are  linearly  inseparable 
or  the  bounds  on  the  displacement  vectors  are  too  large, not  all  sets 
of  observed  patterns  will  be  linearly  separable.  More  generally, 
let  f(xjC^),  f(x|C^)  be  probability  densities  corresponding  to  the 
distributions  of  observed  patterns  in  class  C ^ and  class  C0, 
respectively.  If  these  densities  overlap  on  a region  Of,  where 
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3»  = {x  € IRn:  f C x I cn  > > 0,  f(x|  CJ  > 0) 


then  there  is  no  discriminant,  linear  or  otherwise,  that  will  always 
correctly  classify  an  unknown  test  pattern  x £ 9t. 

The  following  Bayesian  model  is  often  employed  for  the  stochastic 
problem.  Unknown  test  patterns  are  randomly  presented  from  C.  and 
Cp  with  given  prior  probabilities  of  occurrence  nr^  and  77^,  respec- 
tively. Thus  the  test  patterns  have  the  mixture  density 


(4.1.1)  f(x)  - T^ffxjq)  + ir2f(x|CJ 

Let  Pr(C^|x)  be  the  posterior  probability  that  x belongs  to  C 
i = 1,2.  Then  by  the  Bayes  formula. 


(4.1.2) 


Pr(C.|x)  = 7r.f(x|C.)/f(x) 


i = 1,2. 


Define  the  loss  matrix 


( 

A 

A,  ) 

11 

12  1 

| 

L = 

I 

A 

\ 

j 

( 

21 

22  j 

where  A^ 

is  the  loss  incurred  by 

deciding  that  an  unknown  test  pattern 

belongs  to 

C ^ when  its  true 

class 

is  C . 
J 

The  expected  loss  for  the 

decision  " 

x belongs  to  C 

It 

i 

is  thus 

(4.1.3)  qj_(x)  = Ali  Pr(Ci|x)  + A^  Pr(C.|x),  i 4 j.  i = 1,2 
The  decision  rule  that  minimizes  the  expected  loss  is 
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(4.1.U)  decide 

x £ 

if 

81(x)  < ^(x) 

x t C2 

if 

q1(x)  > qp(x). 

This  is  called  the  Bayes 

decision 

rule. 

The  equivalent  Bayes  discriminant 

is 

(4.1.5) 

q(x)  = 

q2(x) 

- qL(x)  . 

Although  the  Bayes  discriminant  is  optimal  in  the  sense  of 
minimizing  the  expected  loss,  rarely  is  enough  information  available  to 
calculate  it.  The  probability  densities  fCxIC^),  f(x|C0)  and  the  prior 
probabilities  and  tt?  are  usually  unknown.  The  only  data  available 

may  be  two  given  sets  of  known  representatives  of  and 

Cg,  respectively.  There  are  several  approaches  in  this  case.  First,  a 
parametric  form  for  each  class  density,  such  as  multivariate  normal, 
may  be  assumed.  The  sample  sets  jJ^,  0 are  used  to  estimate  the 
parameters  and  hence  the  density  functions.  The  estimated  density 
functions  are  combined  with  estimates  of  the  prior  probabilities  tt1 , 

7 Tg  to  yield  an  estimate  of  the  Bayes  discriminant. 

The  formulational  difficulty  with  this  approach  is  that  the 
assumption  that  a class  density  belongs  to  a known  parametric  family 
may  be  unwarranted.  For  example,  in  the  template  matching  problem, 
the  class  densities  may  be  complex  mixtures  of  simpler  densities 
centered  around  the  prototypes.  An  alternative  approach  in  this  case 
is  the  use  of  non-parametric  density  estimation  techniques  such  as 
Parzen  window  function  estimators  (Duda  and  Hart,  [5]).  The  drawback 
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with  this  technique  is  that  it  can  produce  very  complicated  density 
function  estimates  that  require  storage  of  all  given  samples 
x £ U j2  for  implementation. 

The  approach  taken  here  is  to  assume  a parametric  functional 
form,  namely  linear,  for  the  discriminant  function.  The  linear  coefficients 
are  chosen  so  that  the  discriminant  performs  well,  according  to  some 
mathematical  programming  criterion,  on  given  known  sets  of  sample  patterns. 
If  the  sets  of  sample  patterns  are  large  and  well  representative  of  their 
respective  class  populations,  then  the  discriminant  is  expected  to 
perform  well  on  these  entire  class  populations.  The  performance 
criterion  used  will  be  the  error  rate  on  the  given  sets  of  sample  patterns. 
This  corresponds  to  the  Bayesian  loss  matrix 


The  optimal  Bayes  discriminant  is  thus 
(U.1.6)  g(x-)  = P(Cjx)  - P(C2|x) 

corresponding  to  the  decision  rule  of  assigning  the  pattern  to  the 
class  of  greater  posterior  probability. 

h.2.  Linear  Discriminants  by  Mathematical  Programming 

Assume  two  sample  sets  ^ of  known  representatives 

of  classes  C ^ and  C^,  respectively,  are  given.  Let 
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be  the  corresponding  m x (n  + 1)  signed  augmented  pattern  matrix. 

Then  linear  discriminants  of  the  form  g(x)  = wx  - 0 can  be  generated 
as  solutions  to  mathematical  programs  of  the  form 

m 

(4.2.1)  min  f(u)  = I f(u,a.) 

u=(w,  0)6  11  i=l 

where  f(u, a)  is  a penalty  function  that  reflects  the  performance  of  the 
discriminant  defined  by  u on  the  pattern  corresponding  to  a.  Ideally, 
f(u, a)  should  have  the  following  properties. 

PI.  Errors  should  be  penalized  (f(u, a)  >0  if  a*u  < 0)  and  correct 
classifications  rewarded  (f(u, a)  < 0 if  a*u  > 0). 

P2.  The  mathematical  program  (4.2.1)  should  be  easily  solvable  by 
existing  algorithms. 

P3.  If  and  are  linearly  separable,  the  solution  to  (4.2.1) 

should  determine  a separating  hyperplane. 

These  properties  generally  govern  the  choice  of  the  function  f(a,u) 
in  the  models  discussed  below.  However,  in  all  these  cases  at  least 
one  of  the  properties  has  been  sacrificed  to  achieve  the  others. 

4.3.  Minimum  Error  Rate  Programs 

If  error  rate  is  the  dominant  criterion  for  choosing  a decision 
rule,  then  the  best  linear  discriminant  that  can  be  generated  from  the 
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sample  sets  and  ^ is  the  one  that  makes  the  fewest  misclassi- 

fications  on  these  sets.  This  corresponds  to  the  penalty  function 

1 a*u  < 0 

f (u, a)  = 

0 a»u  > 0 

Thus  the  objective  f(u)  = Z^=1  fCu^)  in  (4.2.1)  is  equal  to  the  number 
of  errors  made  by  u on  U J 

Ibaraki  and  Maroga  [8]  have  formulated  this  case  as  the  mixed 
integer  linear  program 

min  e*s 

(4.3.1)  s.t.  Au  + pis  > e 

u = (w ,9)  € ]Rn+1 

(s)  ± = 0 or  1,  i = 1, . . . ,m 

where  P is  a large  positive  number.  If  p is  sufficiently  large, 

/A  A 

then  an  optimal  solution  (u, s)  to  (4.3.1)  satisfies 

(s)±  =0  iff  a^U  > 1 

(‘a)i  = 1 iff  a^O 

and  u is  thus  a minimum  error  rate  discriminant.  Unfortunately,  the 
computational  difficulty  of  solving  (4.3.1)  would  become  prohibitive 
for  large  values  of  m.  Thus  this  penalty  function  has  properties  PI 
and  P3  but  lacks  P2.  Other  choice  of  penalty  function  may  yield  a 
discriminant  with  nearly  as  low  an  error  rate  on  J ^ U </0  with 

far  less  computational  effort. 
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4.  4.  Least  Squares  Programs 

2 

The  penalty  function  choice  f(u,a)  = (l-a*u)  results  in  the 

program 

(4.4.1)  min  ||Au  - ellg 

u=(w,  e)e]Rn+1 

which  is  an  example  of  the  linear  approximation  problem  with  iL  norm 
discussed  in  Section  (3.5).  As  noted  there,  the  explicit  solution  to 
(A. 4.1)  is 

(4.4.2)  u = (A’Aj^A'e 

where  the  existence  of  (A'A)  1 is  guaranteed  by  the  assumption  that  A 
is  of  full  column  rank  (n  + 1).  Computationally,  this  is  the  easiest 
of  the  models  to  solve.  However,  the  model  lacks  Properties  PI  and  P3. 

The  function  (l  - a*u)  penalizes  both  incorrect  (a-u  < 0)  and  correct 
(a*u  > 0)  classifications.  For  correct  classifications,  the  penalty 
actually  increases  as  a-u  increases  past  the  margin  value  of  one.  The 
following  simple  example  illustrates  the  absence  of  Property  P3  due  to 
this  unfortunate  behavior. 

Example  (4,4.3). 

Let  {J  ^ = {a, 2,1),  = (-1)  be  one-dimensional  pattern  sets 

with  a > 0.  Clearly  ^ and  ^ are  linearly  separable  by  the 
discriminant  g(x)  = x for  all  positive  values  of  a.  The  signed 
augmented  pattern  matrix  is 
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A direct  calculation  using  (4.4.2)  shows  that  the  least  squares  dis- 
criminant, after  normalization  to  make  the  coefficient  of  x equal  to 
unity,  is  given  by 


g(x) 


2 

a - 6ct  + h 

x 2a  + 12 


For  all  values  a > 4 + 2 y^,  g(-l)  > 0 and  hence  the  pattern  in  d0 
is  misclassif ied.  The  penalty  that  the  least  squares  criterion  places 
on  excessively  large  absolute  values  of  the  discriminant  function  for 
both  correct  and  incorrect  classifications  gives  too  much  influence  to 
isolated  patterns  that  are  far  from  the  main  group.  □ 


Despite  this  drawback,  the  least  squares  discriminant  has  a 
significant  asymptotic  property.  Patterson  and  Womack  [20]  show  that 
if  ^ | and  are  constructed  from  class  and  patterns, 

respectively,  by  selecting  m independent  patterns  of  known  classifi- 
cation from  the  mixture  distribution  with  density 

f(x)  =7r1f(x|Ci)  + 7r2f(x|C2)  , 

then  the  discriminant  defined  by  (4.4.2)  asymptotically  approaches  the 
minimum  squared  error  approximation  to  the  Bayes  discriminant  g^(x) 
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as  m -» a>. 


This  approximation  minimizes 


(U.4.i+)  / [(wx  - o)  - gQ(x)]2  f (x) dx 

However,  as  Duda  and  Hart  [5]  point  out,  the  best  linear  approximation 

to  the  Bayes  discriminant  does  not  necessarily  have  any  favorable  error 

rate  properties.  Points  where  f(x)  is  large  and  points  far  from  the 

surface  g^x)  = 0 are  emphasized  at  the  expense  of  points  near  this  J 

surface. 

i 

i 

4. 5.  Linear  Discriminants  by  Least  Positive  Deviations 
The  penalty  function 

(4.5.1)  f(u,a)  = (l  - a-u)+ 

< 

leads  to  the  LPD  linear  program  first  suggested  in  a pattern  classifi-  j 

context  by  Smith  [21]  j 

min  e- s I 

(4.5.2)  s.t.  Au  + Is  > e 

I 

s > 0 | 

u = (w,  e)  6 ]Rn+:L  , s e ]Rm 

This  model  has  the  property  P2  since  it  is  relatively  easy  to  solve  by  1 

the  ALPD  algorithm  presented  in  Chapter  3 or  by  the  simplex  method  with 
upper  bounds  applied  to  the  dual.  In  addition,  property  P3  is  satisfied 

| 

since  linearly  separable  problems  are  characterized  by  the  feasibility 
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of  the  system  Au  > e,  so  (4.5.2)  will  produce  a separating  hyperplane 
if  one  exists.  The  only  property  not  satisfied  is  PI.  A penalty  is 
incurred  whenever  f(a,u)  > 0,  which  is  equivalent  to  the  event 
a*u  < 1.  This  event  will  be  called  a margin  violation.  A margin  violation 
is  a true  misclassification  only  if  a-u  < 0.  Thus  correct  classifications 
are  penalized  if  0 < a*u  < 1. 

The  dual  of  (4.5.2)  is 

max  e.y 

(4.5.3)  s.t.  A'y  = 0 

0 < y < e 
y £ lm 

Let  y be  an  optimal  basic  solution  to  (-4. 5.3)  as  determined  by  the 
simplex  method  with  upper  bounds,  and  let  u be  the  optimal  primal 
solution  which  is  the  simplex  multiplier  vector  for  the  terminal  optimal 
basis  in  (4.5.3).  Assuming  non-degeneracy,  the  termination  conditions 
(3.1.8)  for  each  non-basic  variable  (y)  are 


(4.5.4) 


(y)j_  =0  <==>  a.-u  > 1 
(y) i =1  <==>  ai,u  < 1 


Thus  the  patterns  which  are  margin  violators  are  those  for  which  the 
corresponding  optimal  dual  variables  are  at  the  upper  bound. 

Two  distinct  cases  arise  which  are  distinguished  by  the  form 

A A 

of  the  optimal  solution  u = (w, 0): 


J 


i 


JU, 
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Case  1.  w = 0.  This  case  occurs  when  the  corresponding  optimal  dual 


Case  2 


basis  in  (U.5.3)  consists  of  signed  augmented  patterns  which  are 
all  derived  from  a single  sample  set  lL  or  J 0.  The  sample 

set  which  is  the  source  of  the  optimal  dual  basis  will  be  called 

< A ~ 

dominant.  If  £ is  dominant,  9 = -1;  otherwise  9 = + 1. 

In  general  the  dominant  set  is  the  larger  of  the  two  sample  sets 
since  the  optimal  objective  value  is  equal  to  twice  the  number 
of  patterns  in  the  non-dominant  set.  The  discriminant  correspond- 
ing to  u is  the  constant  function  f(x)  = -6,  which  is  equivalent 
to  the  decision  rule  that  classifies  all  patterns  into  the  class 
of  the  dominant  sample  set.  Since  all  of  the  inequalities 
corresponding  to  patterns  in  the  dominant  set  are  tight  at 
u = u,  this  solution  is  degenerate  if  there  are  more  than  (n  + 1) 
such  patterns. 

This  case  may  arise,  for  example,  when  one  of  the  sample  sets 
is  overwhelmingly  larger  than  the  other  and  the  sets  are  not 
linearly  separable.  In  such  circumstances  the  discriminant 
f(x)  = -0,  although  uninteresting,  is  not  unreasonable.  However, 
this  case  can  be  avoided  if  desired  by  appending  additional 
constraints  to  (*+.5.2)  as  seen  in  several  models  discussed  below 
or  by  solving  a weighted  problem  where  greater  weight  is  assigned 
to  the  smaller  set  £ ^ (see  4.5.8). 

w 4 0.  This  is  the  case  of  interest  which  occurs  when  the  optimal 
dual  basis  consists  of  a mixture  of  signed  augmented  patterns 
derived  from  both  pattern  sets.  The  discriminant  hyperplane  can 


( , 
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* = class  C i pattern 
<s>  « Class  Cx  pattern 


Figure  (k.5 


5) . LPD  Discriminant  and  Margin  Planes  Divide  the  Pattern 
Space  into  Four  Regions.  Correctly  Classified  Margin 
Violators  Are  Marked  with  a (')  and  True  Misclassif ications 
with  a ("  ) . 


r 


be  characterized  geometrically  as  follows.  As  shown  in  Figure 
(4.5.5),  this  hyperplane  together  with  the  parallel  margin 

A A A 

hyperplanes  w-x  = 9-1  and  w-x  = 0 + 1 divide  the  pattern 
space  ]Rn  into  four  regions  defined  by 

9?  = {x: w- x >3+1} 

9!  = {x:3  < w-x  <0  + 1} 

(4.5.6) 

9*2  = 0:0  - 1 < w-x  < 0} 

JR^  = {x:w-x  < 0 - 1} 


Assuming  non-degeneracy,  exactly  (n  + l)  of  the  inequalities  Au  > e 
are  tight  at  u = u.  Thus  the  margin  plane  w*x  =0+1  passes  through 

k patterns  from  and  the  margin  plane  w*x  = 0 - 1 passes  through 

n + 1 - k patterns  from  </,  where  1 < k < n.  The  margin  violators 

are  the  patterns  in  ^ that  lie  in  JR^  U 9^  U JR^  and  the  patterns 

in  x/p  that  lie  in  9^  U JR0  U The  true  misclassifications  are 

the  patterns  in  Jl  that  lie  in  JR^  U JR^  and  the  patterns  in  J 
that  lie  in  JR  U JR,. . It  is  shown  below  that  if  the  underlying  pattern 
classes  C and  C are  bounded  and  the  sample  sets  and 

are  large,  then  there  are  approximately  equal  numbers  of  margin  violators 
from  , and  J)  , and  the  centers  of  gravity  (means)  of  the  margin 
violators  in  each  set  are  approximately  equal. 

The  following  example  illustrates  these  concepts 


Example  (4. 5. 7). 

Let  = {1,2,...,  k,  -(k+1) } , d = {-1,-2, ...,-k,  (k+1) } 

be  one-dimensional  pattern  sets  with  k > 2.  This  is  the  same  problem 
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( 


as  that  discussed  in  Example  (2.^.6)  where  the  poor  performance  of  the 
maximum  quality  hyperplane  was  revealed.  Here  the  LPD  discriminant  will 
be  shown  to  be  g(x)  = ox  where  a > 0.  This  discriminant  correctly 
classifies  all  patterns  in  U except  the  outliers  -(k+1)  in 

and  (k+1)  in 

The  signed  augmented  pattern  matrix  is 


For  i = 1,2, . ..,k,  the  two  signed  augmented  patterns  (i,-l)  and  (i,l) 
derived  from  ^ and  > respectively,  define  the  basic  inequality  solution 
u^  = (l/i,0).  As  seen  in  Section  (3.3),  u^  defines  an  optimal  solution 
to  (It.  5.2)  if 


' V 


— (u,  + T d ) 
dr  J 


T=0>° 


< 0 


T=0 


A = 1,2 


where  the  directions 
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l/(2i )' 

-1/2  , 


and 


/l/(2i) 
^ ~ \ 1/2 


are  the  first  and  second  columns  of  the  matrix 

c ■:/ 

and  f is  the  LPD  objective  function  defined  by  f(u)  = e*(e  - Au) 
A direct  calculation  from  formulas  (3.2.8)  yields 


df 

dr 


- (ui  + T dx) 


= ^ (U.  + T dQ)  | 

x=0  dT  ! r=0 


and 


= \ - (w)  ] 


df 

dx 


7 (ut  + T dL) 


= <*L 

x=0  dx 


(ut  + X d2) 


x=0 


= f [iizilliL  . (k+i)] 


Let  i*  be  the  smallest  positive  integer  such  that 


ihu_j.il  > k ♦ ! 


For  k > 2,  it  is  easily  verified  that  1 < i*  < k and  hence  u^ 
is  optimal  for  (4.5.2).  The  corresponding  discriminant  is  g(x)  = 
The  margin  violators  for  this  discriminant  consist  of  all  patterns 
such  that  | x ( < i*  and  the  two  outliers,  while  the  only  true  mis 
classif tcations  are  those  two  outliers.  □ 
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(l/i*)x. 

X 


If  misclassific&tions  of  patterns  from  one  class  are  considered 
more  serious  than  misclassifi cations  from  the  other  class,  it  may  be 
desirable  to  adjust  the  penalty  functions  accordingly.  The  LPD  program 
(4.5.2)  can  be  generalized  to  the  weighted  LPD  model 

min  e*  + a^e*  sD 

(4.5.8)  s.t.  A^u  + Is1  > e 

A^u  + Is_  > e 

CL  d — 

s1  >0,  s2  > 0 

u = (w,  9)  € 3Rn+1,  (s1(s2)  € Bm 

where  A.^  and  A^  are  the  signed  augmented  pattern  matrices  for  J 
and  J 2,  respectively,  and  and  are  scalar  weighting  factors 

reflecting  the  relative  penalty  on  each  type  of  error.  The  dual  of 

(4.5.8)  is 

max  e-y^  + e*y_ 

(^.5.9)  s.t.  A^i  + A’ y2  = 0 

. 0 < yt  < o^e,  0 < y2  < a2e 

y = (yvy2)  € iRm 

Let  y = (y-^y^)  be  an  optimal  solution  to  (4.5.9)  as  determined 
by  the  simplex  method  with  upper  bounds  and  let  u be  the  corresponding 
simplex  multiplier  vector  that  defines  the  weighted  LPD  solution  to 

(4.5.8) .  Define  the  following  index  sets 
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= fi: (y)i  is  basic  and  € J .),  0 = 1,2 

= {i:(y)^  is  non-basic,  equal  to  its  upper  bound  of  a.. 


and  x € .},  j =1,2 

J 

{ i : (y ) ^ is  non-basic  and  equal  to  its  lower  bound  of  zero] 


The  termination  criteria  for  the  simplex  method  with  upper  bounds 
imply  that  and  are  the  index  sets  of  the  margin  violators  from 

and  respectively.  Let  m,  and  m^  be  the  respective  numbers 

of  elements  in  and  U , and  let 


xj  mj  ieu  ** 
d 


j = 1,2, 


be  the  mean  of  the  margin-violators  from  . . The  following  proposition 

J 

will  be  used  to  show  that  for  large  values  of  and  m^,  the  ratio 

m^/m^  of  the  numbers  of  margin  violators  from  to  margin  violators 

from  //,.  is  approximately  equal  to  the  inverse  penalty  ratio  Q^/o^. 
Furthermore,  if  both  pattern  classes  and  are  bounded,  then  the 

means  x^  and  are  approximately  equal. 


PROPOSITION  ( 4. 5 . 10 ) . Let  A be  the  optimal  objective  value  correspond- 
ing to  u in  (4.5.8)  and  let  y = max  ||xj|  for  any  vector  norm 

i=l, . . .,m 

INI.  Then 

A 

A 

a)  A - (n  + 1)  max(a^,a_)  < + nigOj  < ^ 

b)  | < (n  + 1)  max(a1>a2) 

c)  ll^1  - X2II  1 2(n  + 1)  r[max(a1,a2)/max(a1m1,  a2m2)  ] 
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Proof.  Equality  of  the  optimal  primal  and  dual  objective  values  implies 


A = Z (y)i  + Z (y).  + Z (y), 

i€B  IB  i€U,  UU  1 i£L  1 

12  12 

By  the  termination  criteria  of  the  simplex  method  with  upper  bounds,  the 
second  term  is  equal  to  G^m^  + oyn,  and  the  third  term  vanishes. 

Part  a)  then  follows  immediately  from  the  bounds  on  the  (n+1)  basic 
variables  in  the  optimal  dual  solution.  Substitution  of  the  Known  values 
of  the  non-basic  optimal  dual  variables  in  the  constraint  set 

Alyl  + A2y2  = 0 yields 


(4.5.11)  a,  Z ( 

Xi)-“2  * fl) 

= z ( Xi  ) (w 

- z (%i) 

1 i€Ux  V 

-1/  2 i£U2  V -1  / 

iSBg  \ -1  / i 

i£B1  \ -1  / 

The  last  of  the  (n+1) 

equations  (4.5.11) 

implies 

1 

m a - m^a  ! = | Z 

(y),  - Z (SIJ 

beb-l 

i£Bg 

< Z 

(y)< 

i£B1UB2 

Part  b)  then  follows  from  the  bounds  on  the  (n+1)  elements  in  B^  U B0. 
Application  of  the  triangle  inequality  to  the  first  n equations  in 
(4.  5.H)  yields 

(4.5.12)  lla1m1x1  - a2m2x2l|  < (n+l)r  max(a1,a2) 

But 
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r 


||X1  - Xgll 


- Vg  . °^2  - - 

X1  " a1m1  2 aim1  x2  " x2 


< 


- a2m2  - 
xi  ' x2 


am 

i 2 2 


Vll 


Oimi 


By  (4.5.12), 


- °2m2  - 
X1  " x2 


< 


(n+l)r  max(ax, a^) 


°i' 


m. 


and  by  part  b) 


la2m2  - I _ (n+l)r  max(a1,a2) 


aiml 


l|x2ll  < 


a. 


i"i 


Thus 


(4.5.13) 


l|xL  - Xgll  < 


2(n+l)r  max(ax,a2) 

Vi 


Part  c)  then  follows  from  (4.5.13)  and  the  symmetrical  relation  obtained 
by  reversing  the  roles  of  and  V . D 


COROLLARY  (4,5.14).  If  the  underlying  pattern  classes  C ^ and  C0  are 
bounded,  then 


lim 

m1,m2  -*  °° 


A V 

l , \ 

“l0!  + | 

I 

Og 

m2 

1 ^ 

l|xi  - x2h  1 

l 0 / 
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Several  additional  models  have  been  proposed  that  eliminate  the 
use  of  a margin  vector  and  hence  the  distinction  between  margin  violators 
and  true  misclassifications.  In  general  these  are  constrained  weighted 
LPD  problems  of  the  form 

min  u*s^  + v*s^ 

(4.5.15)  s.  t.  A^u  + Is1  > 0 

A2u  + Isg  > 0 

Gu  > b 

S1  > 0,  s2  > 0 

u = (w,0)  £ En+1,  .(srs2)  € ]Rm 

where  u and  v are  strictly  positive  weight  vectors  and  Gu  > b 
is  a set  of  added  constraints  that  eliminate  the  useless  solution  u = 0 
and  the  uninteresting  solution  w = 0,  6 = + 1 that  occurs  when  one 
of  the  pattern  sets  is  dominant.  Grinold  [6]  suggests  a single  added 
constraint  of  the  form 

g**u  > 1 

where  g*  = e*A/m,  i.e.  the  mean  of  all  the  signed  augmented  patterns. 

The  program  (4.5.15)  will  be  feasible  as  long  as  g*  ^ 0.  (The  case 
g*  = 0 occurs  only  when  the  numbers  of  sample  patterns  from  each 
class  are  equal  and  the  sample  means  are  equal. ) Another  possibility 
is  the  pair  of  constraints 

g*-u  > 1 
g**u  > 1 
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where  g*,  j = 1,2  is  the  mean  of  the  signed  augmented  sample  patterns 

J 

from  class  j.  For  the  sake  of  feasibility  it  is  required  that 
g*  / ~gg  or  equivalently  that  the  sample  means  of  the  two  classes  differ. 
The  linear  discriminant  produced  by  (4.5.15)  then  separates  the  two 
class  sample  means. 


4.6.  A Numerical  Experiment 

In  order  to  compare  the  behavior  of  the  ALPD  and  simplex 
algorithms  under  various  problem  conditions,  the  following  numerical 
experiment  was  devised. 

Two  n-dimensional  pattern  sets  and  jJ ? were  constructed, 

each  containing  m/2  patterns.  Each  pattern  x in  g£  was  generated 
by  the  formula 


(x)i  = u[-  1/2,  1/2]  - A , 


i = 1,  - 


where  u[-  l/2,  l/2]  is  a pseudorandom  number  uniformly  distributed 
in  the  interval  [-1/2,  l/2].  Thus  the  patterns  in  are  pseudo- 

random vectors  uniformly  distributed  in  the  interior  of  the  unit  n-dimen- 
sional hypercube  centered  at  -(A,..., A).  Similarly,  the  patterns 


in 


were  generated  by  the  formula 


(x)1  = u[-  1/2,  1/2]  + A , i = 1, . . . ,n 


These  pseudorandom  vectors  are  uniformly  distributed  in  the  interior 
of  the  unit  n-dimensional  hypercube  centered  at  (A,..., A). 

The  situation  for  n = 2 is  illustrated  in  Figure  (4,6.1). 


‘ ■ . 
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Figure  (^.6.1).  Overlapping  Hypercube  Problem  for  n = 2 With  Bayes 
Discriminant  g^x)  = - x.,  - x,,. 
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For  values  of  the  scalar  parameter  A in  the  interval  [0,1/2], 
the  unit  hypercubes  and  Hrj  overlap  on  a cubical  region  of  volume 
V(A),  where 

V(A)  = (1  - 2A)n. 


For  A > l/2,  V(A)  =0  since  and  H0  are  disjoint.  Thus  any 

desired  fractional  overlap  a is  achieved  by  the  setting 


A = 


1 


- a1/n  1 


This  problem  is  intended  to  simulate  a stochastic  pattern  classi- 
fication problem  with  mixture  density 

f(x)  = | f(x|C?1)  + | f(x|C2) 

where 

11  if  x € H, 

J 

0 otherwise 

A Bayes  discriminant  for  the  lowest  error  rate  criterion  is  easily 
verified  to  be  gQ(x)  = -e*x.  As  seen  in  the  two-dimensional  case 
illustrated  in  Figure  (4.6.1),  the  discriminant  plane  g^x)  = 0 
separates  all  of  classes  = (x  G H^}  and  Crj  = {x  G H0)  outside 
the  region  of  overlap  and  passes  through  the  center  of  this  region, 
misclassifying  exactly  half  of  each  class  there  for  an  overall  error 
rate  of  a/2. 
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c 


was  generated  for 


A series  of  pattern  set  pairs 
various  values  of  the  total  number  of 
ality  n,  and  the  fractional  overlap 
parameter  values 


J2 

patterns  m,  the  pattern  dimens ion- 
a.  All  combinations  of  the 


m = 100,  200  , 500,  1000 

(it. 6.2)  n = 1,  2,  5,  10 

a = 0,  .2,  .4,  .6,  .8,  l.o 


were  used  for  a total  of  9 6 cases.  Usually  five  independent  test  problems 
were  run  for  each  case,  although  only  two  and  in  some  cases  one  problem 
were  run  for  some  of  the  larger  values  of  m and'  n.  Altogether  a 
total  of  377  independent  problems  were  solved. 

For  each  case,  the  signed  augmented  pattern  matrix  A was 
constructed.  The  LPD  problem  (4.5.2)  with  tableau  [Ate]  was  solved 
with  the  ALPD  algorithm,  while  the  dual  (4.5.3)  was  solved  by  the  simplex 
method  with  upper  bounds  (SMUB).  Since  the  two  algorithms  use  identical 
pivot  operations  for  basis  changes  but  differ  only  in  the  pivot  selection 
rules,  the  total  number  of  pivots  (basis  changes)  required  to  reach  an 
optimum  solution  from  the  same  initial,  arbitrarily  chosen  basis  serves 
as  a convenient  basis  of  comparison.  (Thus  the  ALPD  initialization 
algorithm  given  in  Section  (3.4)  was  not  used.  Rather  the  (n+1)  members 
of  the  initial  basis  were  chosen  as  the  signed  augmented  patterns 
corresponding  to  the  first  n patterns  in  and  the  first  pattern 

in  J?.) 
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In  all  cases  except  those  for  which  a = 0,  the  same  optimal 


• basis  was  achieved  by  both  algorithms.  Whan  a=  0 the  pattern  sets  are 

! linearly  separable  and  several  distinct  optimal  bases  may  exist. 

Frequently  the  algorithms  arrived  at  different  optimal  solutions  in  this 
case,  although  of  course  each  optimal  solution  defined  a separating 
hyperplane.  In  general,  the  error  rate  achieved  on  2 by  the 

discriminant  corresponding  to  the  optimal  solution  was  usually  very  close 
to  the  Bayes  error  rate  of  a/2  with  sms.ll  fluctuations  about  this  rate 
due  to  the  finite  size  of  the  pattern  sets. 

Average  values  of  the  numbers  of  pivots  required  by  the  ALPD  and 
SMUB  algorithms  are  listed  in  Table  (4.6,3)  for  each  case.  Some  graphical 
representation  of  this  data  is  provided  by  Figures  (4.6.4)  through 
(4.6. 13)  which  reveal  two  clear  trends. 

First,  as  seen  in  Figures  ( 4.6.4 ) through  (4.6.7),  the  SMUB 
algorithm  is  highly  sensitive  to  the  fractional  overlap  a while  the 
ALPD  algorithm  is  not.  For  a = 0 the  numbers  of  required  SMUB  and 
ALPD  pivots  are  nearly  equal.  As  a and  here  the  degree  of  infeasibility 
of  the  system  Au  > e increases,  the  number  of  SMUB  pivots  increases 
very  quickly  and  then  levels  off  while  the  number  of  ALPD  pivots 
remains  relatively  constant.  For  several  cases  with  large  values  of 
a,  the  relative  advantage  of  the  ALPD  algorithm  in  terms  of  number  of  pivots 
reaches  a factor  of  several  hundred.  For  a given  value  of  a,  this 
factor  seems  to  be  an  increasing  function  of  the  aspect  ratio  m/(n+l) 
of  the  matrix  A. 
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Second,  as  seen  in  Figures  (U.6.8)  through  (i;.6.13),  for  fixed 
values  of  a and  n the  number  of  pivots  appears  to  be  a linearly 
increasing  function  of  m.  However,  except  for  the  a = 0 case,  the 
rate  of  increase  is  much  higher  for  the  SMUB  than  the  ALPD  algorithms. 

The  computational  advantage  of  the  ALPD  algorithm  thus  appears 
most  significant  for  problems  in  which  the  matrix  A has  a high  aspect 
ratio  and  the  underlying  inequality  system  has  a large  degree  of 
infeasibility.  Such  problems  arise  not  only  in  linearly  inseparable 
pattern  classification  models  with  large  pattern  sets  but  also  in  the 
linear  approximation  problem  (3.5.6)  with  £ norm.  For  such  problems 
usually  the  number  of  data  points  greatly  exceeds  the  number  of  parameters 
to  be  fit,  thus  creating  the  high  aspect  ratio  situation.  The  large 
degree  of  infeasibility  in  the  underlying  inequality  system  arises 
naturally  since  it  is  comprised  of  the  two  systems  Ax  > b and 
-Ax  > -b. 
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Table  (4.6.3).  Average  Number  of  Pivots  Required  by  Accelerated  Least  Positive  Deviations  (ALPD) 

and  Simplex  Method  with  Upper  Bounds  (SMUE)  Algorithms 


SHU 8 


AVERAGE  NUHBER  OF  SNUB  PIVOTS  (OVERLAP  * B%)  LEGEND 


AVERAGE  NUMBER  OF  SMUB  PIVOTS  (OVERLAP  « 40* > LEGEND 


NUMBER  OF  PATTERNS 


AVERAGE  HUMBER  OF  ALPD  PIVOTS  (OVERLAP  * 40%)  LEGEND 

H*I 


i 


4 


i 


I 


<M  I 


I 

I 

I 

m i 


HUMBER  OF  PATTERNS  (M) 


AVERAGE  NUMBER  OF  SNUB  PIVOTS  (OVERLAP  ■ 100%)  LEGEND 
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<H> 


AVERAGE  NUMBER  OF  ALPO  PIVOTS  (OVERLAP  * 109%)  LEGEND 


NUMBER  OF  PATTERNS 


CHAPTER  5 


PIECEWISE  LINEAR  DISCRIMINANTS 

5. 1.  Piecewise  Linear  Discriminants 

A direct  generalization  of  the  linear  discriminant  is  the  piece- 
wise  linear  discriminant.  Piecewise  linear  functions  f : ]Rn  if  Can 
be  defined  recursively  as  follows  (Chang  [22]): 


I 


Definition  (5.1.1).  Piecewise  Linear  Function 
1.  Any  linear  function  f (x)  = w*x'  - 9 is  piecewise  linear. 
If  fx(x),  are  piecewise  linear,  then  so  are 


(5.1.1) 


f(x)  = maxff  (x),  f (x)) 
and  1 £=- 

g(x)  = min(f1(x),  fg(x)) 


3.  No  other  functions  are  piecewise  linear. 

Piecewise  linear  functions  of  arbitrary  complexity  can  be  con- 
structed by  repeated  use  of  rule  2 in  (5.1.1).  With  V " and  ” /\” 
representing  the  maximum  and  minimum  operators  respectively,  the 
following  identities  are  useful  for  manipulating  expressions  involving 
piecewise  linear  functions. 
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a)  fx  A(f2V  f5)  = (^A  f2)  V (f!  Afj) 

b)  fx  V (f2  Ay  = (fxV  y A (fx  Vf3) 
(5-1,2)  c)  -(fx  V f2)  = -fl  A -f2 

d)  -(fLAV  = -fLV  -f2 


By  repeated  use  of  the  distributive  property  a)  and  the  associativity 
and  commutativity  of  the  minimum  and  maximum  operators,  any  piecewise 
linear  function  f can  be  written  in  disjunctive  normal  form 


(5.1.5) 


11 , 

m 1 

f = V ( A f ) , 

i=l  j=l  3 


where  each  f . . is  linear, 
id 

This  representation  has  the  following  geometrical  interpretation 

m 

Let  9!  = {x:  f (x)  >0).  Then  « = U where  ^ is  the  polyhedral 

i=l 

convex  set  defined  by  the  linear  inequality  system 


(5.I.1*) 


f„(x)  > 0 
11 


00  > 0 

ini 


Thus 


n, 

each  concave  function  f.  = (A  A •)  in  (5.1.5)  isolates  a convex 

j=l  3 

region  9!^  whose  boundaries  are  defined  by  the  hyperplanes  f^(x)  = 

f (x)  =0.  In  a two-class  pattern  classification  problem  with 
^ni 

pattern  sets  and  if  each  region  contains  patterns  only  from 

and  together  the  regions  contain  all  of  the  patterns  in 


then  f = Y f is  a piecewise  linear  discriminant  that  separates 
i=l  1 

10  k 


Figure  (5.1.5).  The  Piecewise  Linear  Function 

(fii  A fi2^  V (f2i  A f22)  V (f51  A ^2'^ 

Separates  the  Two  Pattern  Classes. 

The  disjunctive  normal  form  representation  (5.1.5)  can  be  used 
to  show  that  piecewise  linearity  is  preserved  under  the  operations 
of  addition  and  scalar  multiplication. 

PROPOSITION  (5,1.6).  If  f ^ and  f are  piecewise  linear  functions 
then  afL  + 3f^  is  piecewise  linear  for  all  real  constants  a,  (3. 
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Proof.  It  is  sufficient  to  show  that  ccf^  and  f^  + fQ  are  piecewise 
linear.  Let 


m A1 

tj.-  V(A 

i-i  j-i  iJ 


S 

r.  aP 

V (A  m 

p=l  q=l 


pq 


be  disjunctive  normal  form  representations. 


Scalar  multiplication: 
If  a > 0 , 


If  a < 0 , 


of. 


m 


V 

i-1 


.i 

(A  a i ) 
j=l  J 


af,  = -( 


n. 

i 


alfL) 


m 

A (V 

i-i  j-i 


a 


/.  .) 

id 


Addition: 


n, 

m . i 


f!  * f2  ' V (A  V * f2 


i=l  d=l 


n4 

m .i 


= V ( A u.,  +fs)) 
i=l  d=l  J 


r p 

But  i.  . + f?  = y (A  U,  . + m )),  which  is  piecewise  linear. 
0 p=l  q-1  3 pq 


□ 


5.2.  Some  Examples. 

Many  pattern  classification  schemes  implicitly  use  piecewise 
linear  discriminants.  An  example  is  the  minimum  distance  classifier, 
also  known  as  the  "nearest  neighbor"  rule. 
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Let  CP  = {p^  ...  , Pj),  {qi,  ...  , qfc)  be  sets  of  prototype 
patterns  representing  classes  C ^ and  C ' , respectively.  Let 


(5.2.1) 


d (x)  = min  { ||x  - p ||  ) 
i=l,...,j  12 

d (x)  = min  { ||x  - q J ) . 

1*1,..., k 


A minimum  distance  classifier  is  defined  to  be  a classification  procedure 
that  implements  the  discriminant  function 


(5.2.2) 


f(x)  = d2(x)  - d2(x)  . 


Thus  a pattern  x is  classified  into  the  class  of  the  nearest  prototype 
pattern  as  measured  by  Euclidean  distance.  This  discriminant  is  piecewise 
linear  by  Proposition  (5.1.6)  since  it  can  be  written  in  the  form 

(5.2.3)  f (x)  = min  {-2q  -x  + ||q  \\2)  - min  {-2p  *x  + ||p  \\h  . 

12  i=l,...,j  1 12 

Minimum  distance  classifiers  are  particularly  effective  in 
situations  where  the  patterns  in  each  class  cluster  into  isolated 
subclasses.  If  the  clusters  are  sufficiently  far  apart,  then  a single 
prototype  pattern  selected  from  each  subclass  ant  included  in  the 
appropriate  set  (j>  or  will  insure  good  performance  of  the  discriminant 
on  that  subclass.  Such  multimodal  behavior  is  sufficiently  common  that 
the  problem  of  clustering  multidimensional  data  has  received  much 
attention  (e.g.  f 5] , Ch.  6). 


107 


Even  in  the  absence  of  such  clustering  behavior,  a minimum 
distance  discriminant  can  always  be  found  that  separates  two  finite, 
disjoint  pattern  sets  and  . This  follows  immediately  from 

the  choice  (?  = <J  0 . When  a minimum  distance  classifer  uses 

prototype  sets  consisting  of  large  numbers  of  known  sample  patterns 
from  classes  C ^ and  C,  , respectively,  the  terminology  "nearest 
neighbor  rule"  is  often  used  to  describe  the  classification  procedure. 
Cover  and  Hart  [23]  show  that  if  the  known  sample  patterns  are  drawn 
from  the  same  mixture  distribution  that  produces  the  test  patterns, 
the  asymptotic  error  rate  on  new  patterns  as  the  number  of  known  samples 
increases  without  bound  is  less  than  twice  the  error  rate  of  the  Bayes 
discriminant.  However,  this  performance  is  achieved  at  the  very  con- 
siderable price  of  a large  data  storage  requirement  for  the  list  of 
prototype  patterns  and  the  computational  effort  required  to  identify 
the  nearest  known  sample  to  a test  pattern. 

Another  example  of  a piecewise  linear  discriminant  is  found 
in  the  layered  network  of  threshold  logic  units  discussed  in  Section  2.2. 
Nilsson  [24]  shows  that  if  there  are  k TLU' s in  the  first  layer,  then 
a layered  machine  implements  a discriminant  of  the  form 

(5.2.U)  f(x)  = max  [f (x))  - max  (f  (x)] 

i=l, . . . , j i=j+l,...,2k 

where  each  f^Cx)  is  linear. 

In  the  next  section  piecewise  linear  discriminants  of  the  form 

f(x)  = V f<(x) 
i=l  1 
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where  each  f^(x)  is  linear  are  considered.  Necessary  and  sufficient 
conditions  for  the  existence  of  a discriminant  of  this  type  that  separates 
two  given  finite  pattern  sets  are  developed.  General  applicability  of 
this  discriminant  to  arbitrary  finite  pattern  sets  is  then  demonstrated 
by  use  of  a class  of  pattern  space  transformations. 


5.3.  Convex  Separability 

Let  ^ be  subsets  of  ® where  $ is  a convex  subset 

of  En  . 


Definition  (5.3.1).  ^ is  convex  separable  from  J } if  there  exists 

a continuous  convex  function  f:$  ->  IR1  such  that 


f (x)  > 0 

v x € J 

f (x)  < 0 

V x € J 

PROPOSITION  (5.3.2).  Let  = {x^  ...  , x k)  be  a finite  point  set 

and  let  0 be  any  subset  of  IRn.  If  gf P is  convex  separable  from 

then  there  exists  a convex  piecewise  linear  separating  function  f(x). 


Proof.  Let  g(x)  be  a continuous  convex  separating  function  and  let 
91  = {x:g(x)  < 0} . 3!  is  an  open  convex  region  whose  closure  & contains 
J 2 as  a proper  subset  and  does  not  intersect  g/  . Thus  by  the 
separating  hyperplane  theorem,  for  each  x^  € there  exists  a hyperplane 


109 


w^x  = 0^  which  separates  xi  from  #,  i.e.  f^x)  - w^x  - 0^  is 

k 

positive  for  x = x.  and  negative  for  all  x t &.  Then  f(x)  = \/  f (x) 

P 1=1 

is  a convex  piecewise  linear  function  that  separates  from 


Proposition  (5.3.2)  can  be  used  to  prove  the  following  geometric  criterion 
for  convex  separability  of  finite  pattern  sets.  Let  C(J>  ) denote  the 
convex  hull  of  ^ . 


PROPOSITION  (5.3.-) . Let  J J be  finite,  disjoint  pattern  sets. 

Then  is  convex  separable  from  iff  J l n C{J2)  = g(. 

Proof.  C(J  0)  is  a closed  convex  set.  If  n c(  J2)  = <f,  then  a 

convex  piecewise  linear  separating  function  can  be  constructed  as  in  the 

proof  of  Proposition  (5.3.2).  Conversely,  if  a continuous  convex  function 

f separates  </..  from  J , f is  strictly  positive  on  qF  and 
Id  1 

strictly  negative  on  d By  convexity,  f is  also  strictly  negative 
cm  C(J2).  Hence  n C(^)  = 0.  □ 

Figure  (5.3.  *0  demonstrates  that  convex  separability  is  not  a 
symmetric  relation  between  and  Here  n c (gf2)  = <?, 

but  C(J1)  n 4 0. 

Clearly  not  all  disjoint  pattern  sets  are  convex  separable. 
However,  the  following  sufficient  condition  for  convex  separability 
motivates  a class  of  coordinate  transformations  that  render  all  finite 
disjoint  pattern  sets  convex  separable  in  the  transformed  space. 
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Patterns  But  Not  Conversely. 


Figure  (5.3.6).  Finite  Disjoint  Pattern  Sets  on  the  Surface  of  a Sphere 
Are  Always  Convex  Separable.  The  Convex  Hull  of  the 
Class  C Patterns,  Except  for  the  Patterns  Themselves 
Lies  Inside  the  Sphere  and  Thus  Cannot  Contain  Any 
Class  C^  Patterns. 

Ill 

- 


i 


PROPOSITION  (5.3.3) . Let  = {x^  ...,xk),  ^ g = { *k+1»  • • • » *m)  be 

finite  disjoint  pattern  sets  and  let  f(x)  be  a strictly  convex  function 

defined  on  C Jp) • If  there  exists  a real  constant  a such  that  | 

f(x)  = a for  all  x € J1  U then  is  convex  separable  from  a^2  j 

and  , is  convex  separable  from  jJ 

Proof.  Let  x € cC^^).  Then  there  exist  non-negative  constants 

...  , such  that  E£=1  \ = 1 and  x = \xj/  By  convexity  j 

of  f, 

k 

f(x)  < E A,  f(x  ) = a . 
i=l  1 1 

< 

By  strict  convexity  of  g,  f(x)  < a if  x is  not  an  extreme  point  of 

C( i-e.  ^ x 4 J Since  ^ ^ and  ^ 2 are  disjoint  and  J 

f (x)  = a for  all  x € d2>  H = 0.  By  inverting  the  roles 

of  and  it  follows  also  that  J1  n C(J2)  ~ 4-  D 1 


Geometrically  this  proposition  states  that  two  disjoint  pattern 
sets  distributed  on  the  surface  defined  by  the  equation  f(x)  = a, 
where  f(x)  is  strictly  convex,  are  convex  separable  from  each  other. 
This  follows  from  the  fact  that  convex  hull  of  each  set  intersects  the 
surface  only  at  the  points  in  the  set  itself.  This  is  illustrated  for 
the  case  of  a sphere  in  Figure  (5.3.6). 


i 


M 


112 


Example  (5.3.7). 

Disjoint  binary  pattern  sets  in  ®n,  where  each  pattern  component 
is  either  equal  to  +1  or  -1,  are  convex  separable  since  they  satisfy 
the  hypotheses  of  Proposition  (5.5.5)  with 

n 

f(x)  = E (x),  and  a - n . □ 

i=l 

For  two  general  finite  disjoint  pattern  sets  £ , , in 

!Rn,  it  is  possible  to  define  a one-to-one  mapping  into  sets  J'vd'2 
in  2Rn+J"  such  that  ^ \ and  £ are  convex  separable.  Let 
g«]Rn  ->  1R^  be  a strictly  convex  function  defined  on  C U Let 

hs  1R  ]R  ^ be  a strictly  convex  function  with  an  inverse  h”1;®1--*  ®1 
defined  on  {a  - g(x):x  £ J ^ U J^)}  for  some  real  constant  a.  For 
each  x £ jJ^.U  ,,,  define  the  transformed  pattern  y £ ®n+^  by 

(5.5.8)  y = (x,  h 1(a  - g(x)) 

Let  p be  the  sets  resulting  from  applying  the  transformation 

(5.3.8)  to  the  patterns  in  and  , respectively. 

PROPOSITION  (5.3.9).  The  transformed  pattern  sets  ^ ^ ^ } are  convex 
separable  from  each  other. 

Proof.  The  function  f:®n+1  ->  ®^-  defined  by 

f(x,3)  = g(x)  + h(3),  x £ ®n,  a £ ®L 
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is  strictly  convex.  Hie  set  and  are  disjoint  and  if 

y^V  4 

f(y)  = f(x,  h_1(a  - g(x)) 

= g(x)  + h(h-1(a  - g(x))) 

= ct 


and  the  result  follows  from  Proposition  (5.3.5)*  D 


1 

The  se'ts  iy  4 are  formed  by  mapping  the  patterns  in  ^ I 

and  0 onto  the  surface  f(y)  = a in  a one-higher  dimensional  space. 

The  following  two  examples  provide  pattern  space  transformations  that 

* 

are  valid  for  all  finite,  disjoint  pattern  sets  in  1R  . 


Example  (5.5.10). 

n p 2 

Let  g(x)  = Z (x),,  h(p)  = p~.  Choose  a = max  (g(x  )). 
i=l  i=l,...,k 

Then 

y = (x,  y/a.  - g(x)  ) 

is  the  desired  pattern  space  transformation.  In  this  example  the  n-dimension 
patterns  in  J ^ U J ? are  mapped  onto  the  surface  of  the  (n+1) -dimensional 
sphere  of  radius  centered  at  the  origin.  □ 

Example  (5.5.11) . 

This  example  works  for  any  strictly  convex  function  g(x),  e.g. 
g(x)  = xCx  where  C is  an  n x n positive  definite  matrix.  Choose 
h(p)  = -£n(p).  Then  the  desired  pattern  space  transformation  is 


— - 


llU 


y = (x,  nsW) 


The  strictly  convex  function 

f(x,&)  = g(x)  - £n(p) 


is  equal  to  zero  for  all  transformed  patterns  y. 


□ 


In  the  next  section  an  algorithm  is  presented  that  constructs  a 
convex  piecewise  linear  discriminant  by  the  method  suggested  in  the 
proof  of  Proposition  (5.3.2).  An  arbitrary  pattern  is  chosen  from 
^ y and  then  a linear  discriminant  separating  this  pattern  from  the 
entire  set  <$  , is  found  as  a solution  to  a constrained  LPD  problem. 

The  problem  is  designed  to  encourage  the  separation  of  as  many  as  possible 
additional  patterns  in  £ ^ from  along  with  the  chosen  one.  All 

patterns  in  that  are  separated  from  are  then  dropped  from 

jJ n and  the  process  is  repeated  with  new  linear  discriminants  until  d ^ 
is  empty. 


5.  L.  An  Algorithm  for  Convex  Piecewise  Linear  Separation 

Let  = {xL, ...,  x^},  d,  = {x£+1>  • ••  > xj  be  finite  dis- 

joint pattern  sets  such  that  Ji ^ is  known  to  be  convex  separable  from 
J 0 (e.g.  the  patterns  are  binary  or  have  undergone  the  transformation 

described  in  Section  5.3).  An  algorithm  is  now  presented  that  determines 
a convex  piecewise  linear  separating  function. 
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Let 


k = iteration  number 


= se-t  of  patterns  in  not  yet  separated  from  before 


the  kth  iteration. 

x ^ = selected  element  of  J . 

1 


= signed  augmented  pattern  matrix  corresponding  to  i'k)- 


A0  = signed  augmented  pattern  matrix  corresponding  to  q/0. 


a^  = signed  augmented  pattern  corresponding  to  x^  , 


ALGORITHM  (^.U.l). 

Step  1.  Set  k = 1,  Go  to  Step  1. 


Step  2. 


Choose  an  arbitrary  pattern  x^  € ^ . 


Form  the  matrix 


and  solve  the  constrained  LPD  problem 


rain  e- s 


(5.^.2) 


s.t. 


A^u  + Is  > e 


(k) 

av  'u 


A2u 


> 1 

> e 


s > 0 


.n+1 


u = (w,  0)  € TR 
(k)  (k)  fk) 

Let  uv  = (wv  ,0V  ) be  an  optimal  solution  to  (5.^.2). 

Go  to  Step  3. 


Step  3.  Set 


(k+l) 


= {xt  e 


'00  „00 


»w 


<^k)  < o). 


If  ^ jk+J-)  go  step  otherwise  increment  k 

by  1 and  go  to  Step  2. 
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Step  4.  Stop.  Let  k*  be  the  final  value  of  k.  Then  the  desired 
convex  piecewise  linear  separating  function  is 


f*(x) 


V (w^.x  - 0^)  . 

i=l 


i 


Proof  of  Algorithm; 

Since  J ^ is  assumed  convex  separable  from  each  indi- 

vidual pattern  in  1 is  linearly  separable  from  ^ . Thus  the 
inequality  system 


(5.4.3) 


1 


a2u  - e 


is  feasible  and  hence  an  optimal  solution  u^  to  (5.4.2)  exists  by 
Proposition  (3.5.11).  Since  w(k)-x^  - e^k)  > 1,  ^k+l)  is  smaller 
than  gf ^ by  at  least  one  element  for  all  k < k*.  Thus  the  algorithm 
must  terminate  in  at  most  i iterations.  For  each  xi  £ A , there  is 
at  least  one  value  of  k such  that  w^k^-xi  - 6 ^ > l.  Hence 
f*(x)  >0  for  all  xG  Also,  since  A0u^  >e  for  k = 1,  ...,k*, 

f*(x)  < 0 for  all  x G □ 


The  linear  program  (5.4.2)  produces  a hyperplane  that  minimizes 
the  sum  of  the  infeasibilities  corresponding  to  remaining  class  C 
patterns  subject  to  the  constraint  that  all  class  C ! patterns  and  a 

specified  C ^ pattern  are  on  the  'correct*  side  of  their  respective 
margin  planes.  Hopefully  this  LPD  form  of  the  objective  function 


fv: 
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encourages  the  optimal  hyperplane  to  separate  other  class  C ^ patterns 
in  addition  to  the  specified  one  at  each  iteration  whenever  possible. 

Toward  this  end  it  has  been  found  that  for  several  test  problems  of  the 
overlapping  hypercube  type  discussed  in  Section  (4.6),  replacement  of 
the  constraints  A^u  > e in  (5.4.2)  with  A0u  > ee,  where  e is  a 
very  small  positive  number,  often  reduces  the  total  number  of  iterations 
required.  In  effect,  this  change  eliminates  the  margin  problem  for  the 
class  C?  patterns  and  forces  the  optimal  hyperplane  to  pass  very  close 
to  the  convex  hull  of  ^ 0.  Thus  for  sufficiently  small  values  of  c, 
the  possibility  of  a class  C pattern  lying  between  the  optimal  hyper- 
plane and  this  convex  hull  is  eliminated.  Numerical  experience  with 
this  revised  form  of  the  algorithm  suggests  that  when  the  selected 
class  C ^ pattern  is  part  of  a cluster  of  C ^ patterns  that  are 
linearly  separable  from  gf0,  all  or  nearly  all  of  the  cluster  is 
separated  by  the  optimal  hyperplane.  The  following  example  illustrates 
this  behavior. 

Example  (5.4.4) 

The  overlapping  hypercube  problem  discussed  in  Section  (4.6) 
was  selected  as  a test  case.  A total  of  m = 200  patterns  of  dimension 
n = 2 were  generated,  half  in  each  class.  The  two  unit  squares  overlapped 
on  an  area  of  a = 0.20.  To  introduce  convex  separability,  the  patterns 
were  mapped  onto  the  surface  of  a three  dimensional  sphere  by  the  trans- 
formation given  in  Example  (5.3.10).  The  resultant  three-dimensional 
patterns  were  separated  by  a convex  function  generated  by  the  revised 
version  of  algorithm  (5.4.1).  The  constrained  LPD  problems  were  solved 


r . 
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by  the  ALPD  algorithm  after  conversion  to  a weighted  LPD  format.  The 
separation  sequence  is  shown  in  Table  (5.^.5).  The  problem  required  a 
total  of  11  iterations  for  complete  separation.  The  first  iteration 
hyperplane  succeeded  in  separating  a large  cluster  of  78  class  C ^ 
patterns,  while  subsequent  hyperplanes  separated  either  isolated  patterns 
or  small  clusters.  This  behavior  is  consistent  with  the  geometry  of  the 
problem.  In  the  original  pattern  space  (]R  ) , the  20$  overlap  factor 
implies  that  a large  fraction  of  the  class  C patterns  should  be 
linearly  separable  from  Since  the  mapping  of  the  patterns  onto 

the  sphere  in  leaves  the  first  two  coordinates  intact,  linear 

separability  of  these  patterns  is  preserved.  The  remaining  class  C ^ 

2 

patterns  in  1R  are  uniformly  distributed  in  or  near  the  area  of 

overlap.  Thus  the  transformed  patterns  in  1R^  from  the  overlap  area 

2 

in  1R  are  expected  to  show  little  tendency  to  cluster  by  class 
with  only  small  linearly  separable  clusters  of  class  C patterns 
being  formed  by  chance.  □ 
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Iteration 

Number 

Remai ning  Class 

Patterns 

Number  of  Class 

Ci  Patterns 
Separated 

1 

100 

78 

2 

22 

3 

3 

19 

1 

4 

18 

6 

5 

12 

1 

6 

11 

2 

7 

9 

1 

8 

8 

2 

9 

6 

1 

10 

5 

3 

11 

2 

2 

Table  (5.4.5). 

Separation  sequence  of  convex 

separation  algorithm 

in  Example  (5.4.4) 
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Problems  in  pattern  recognition  are  treated  by  the  methods  of  mathematical 
programming.  In  particular  the  two-class  pattern  classification  model  with 
decision  rules  based  on  discriminant  funn cions  is  considered  with  emphasis 
on  mathematical  programs  that  determine  line  r and  piecewise  linear  discri- 
minants . 


For  linearly  separable  pattern  st  :s  of  separating  hyperplane  can  be  determined 
by  solving  a system  of  linear  inequa  ,_ies.  This  system  serves  as  the  con- 
straint set  for  a class  of  mathemati  al  programs  that  define  separating 
linear  discriminants  exhibiting  maximum  tolerance  to  pattern  noise.  Specific 
cases  that  can  be  modelled  as  linear  and  quadratic  programs  are  discussed 
and  a reliability  interpretation  of  the  objective  criterion  is  given.  _ 

r-— - 

Application  of  linear  discriminants  to  the  linearly  inseparable  case  leads  to 
consideration  of  solution  concepts  .or  possible  infeasible  linear  inequality 
systems.  The  Least  Positive  Deviations  (LPD)  solution  to  the  general  system 
Ax  _>  b,  where  A is  a (in  x n)  matr  x with  x E 1R  n and  b £ R'n,  is  defined 
by  a Phase  I linear  programming  model.  An  equivalent  unconstrained  minimi- 
zation problem  with  a piecewise  linear  c'  ective  serves  as  the  basis  for  the 
development  of  the  Accelerated  Lea  t Posi  'vc  Deviations  (ALPD)  algorithm 
for  the  solution  of  the  model.  T’.  algorithm  is  shown  to  be  implementable 
by  a sequence  of  pivot  operations  c ' the  same  type  as  employed  by  the  simplex 
method  with  upper  bounds  applied  to  the  dual  of  the  Phase  I problem  but  with 
a novel  pivot  selection  rule  and  without  regard  to  the  upper  bounds.  At 
each  iteration  the  pivot  selection  is  determined  by  the  solution  to  an  uncon- 
strained minimization  of  a piecewise  linear  function  of  a one-dimensional 
variable.  Like  the  simplex  method,  the  ALPD  algorithm  converges  in  a finite 
•>amber  of  iterations  to  an  op  .irnal  solution.  A direct  comparison  of  the 
relative  efficiencies  of  the  simplex  and  ALPD  algorithms  can  be  made  in  terms 
of  the  number  of  basis  changes  required  to  reach  optimality  from  the  same 
arbitrary  Initial  basis.  Results  of  an  extensK  series  of  numerical  tests 
are  reported  which  indicate  a large  ALPD  advantage  for  linearly  inseparable 
classification  problems.  The  advantage  appears  to  increase  with  the  aspect 
ratio  (m/n)  of  the  matrix  A and  the  degree  of  infeasibility  of  the  under- 
lying Inequality  system. 

The  LPD  problem  is  generalized  to  the  weighted  and  constrained  weighted  least 
deviations  problems,  which  are  shown  to  be  directly  solvable  by  the  ALPD  algo- 
rithm. Examples  of  such  problems  are  presented  from  linear  estimation  and 
control  theory.  The  general  linear  programming  problem  is  also  formulated  as 
a constrained  weighted  least  deviations  model.  Properties  of  LPD  and  related 
models  are  explored  for  classification  problems  and  an  asymptotic  LPD  dis- 
criminant characterization  is  obtained. 

The  LPD  methodology  is  extended  to  piecewise  linear  discriminants.  A class  of 
pattern  space  transformations  is  defined  that  renders  any  pair  of  finite 
disjoint  pattern  sets  separable  by  a convex  piecewise  linear  function.  An 
algorithm  is  presented  that  constructs  such  a function  through  the  solution 
of  a sequence  of  constrained  weighted  least  deviations  problems.  Results  of 
a numerical  test  problem  are  presented. 
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